PytorchJob DDP training will stop if I delete a worker pod #1478

Shuai-Xie · 2021-11-22T08:23:08Z

Hi, everyone.

I want to test the failure tolerance of PytorchJob.

I started a PytorchJob with 1 master and 3 workers.

$ kubectl get pods -o wide
NAME                 READY   STATUS    RESTARTS   AGE     IP           NODE
mnist-ddp-master-0   1/1     Running   0          2m55s   10.10.10.1   11.71.1.160
mnist-ddp-worker-0   1/1     Running   0          2m55s   10.10.10.2   11.71.1.161
mnist-ddp-worker-1   1/1     Running   0          2m55s   10.10.10.3   11.71.1.161
mnist-ddp-worker-2   1/1     Running   0          2m55s   10.10.10.4   11.71.1.162

It trains fine.

Then I deleted a worker.

$ kubectl delete pod mnist-ddp-worker-1

As I set restartPolicy: OnFailure, this pod will restart quickly with the same name mnist-ddp-worker-1.

But sadly, I can't see this newborn worker join the DDP training.

Thanks.

The text was updated successfully, but these errors were encountered:

gaocegege · 2021-11-24T08:09:52Z

Are you sure that your training script is fault tolerant?

Shuai-Xie · 2021-11-27T08:42:53Z

No. The training code is almost the same as the official PyTorch mnist demo.
I just want to test if the restarted worker can join the DDP training automatically.

By the way, I notice that elasticPolicy feature has been integrated into PytorchJob.
Does that mean that we can use the elastic feature in kubeflow 1.4.0 with training-operator 1.3.0 now?

gaocegege · 2021-11-29T02:02:27Z

I just want to test if the restarted worker can join the DDP training automatically.

I do not think the official PyTorch demo supports fault-tolerant. Maybe you need to dive into the demo.

Does that mean that we can use the elastic feature in kubeflow 1.4.0 with training-operator 1.3.0 now?

1.4.0 is already released. But the elasticPolicy is introduced in master branch, thus you cannot use it in 1.4.0.

Shuai-Xie · 2021-12-01T07:48:00Z

OK. Many thanks.

Shuai-Xie closed this as completed Dec 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PytorchJob DDP training will stop if I delete a worker pod #1478

PytorchJob DDP training will stop if I delete a worker pod #1478

Shuai-Xie commented Nov 22, 2021

gaocegege commented Nov 24, 2021

Shuai-Xie commented Nov 27, 2021

gaocegege commented Nov 29, 2021

Shuai-Xie commented Dec 1, 2021

PytorchJob DDP training will stop if I delete a worker pod #1478

PytorchJob DDP training will stop if I delete a worker pod #1478

Comments

Shuai-Xie commented Nov 22, 2021

gaocegege commented Nov 24, 2021

Shuai-Xie commented Nov 27, 2021

gaocegege commented Nov 29, 2021

Shuai-Xie commented Dec 1, 2021