trainning-operator may need to monitor PodGroup #1574

qiankunli · 2022-04-16T10:35:56Z

I create a pytorchjob, then pytorch-controller create a PodGroup. The PodGroup was pending state at the beginning, so no pod was created for the pytorchjob. After the PodGroup became the inqueue state, It seems that the controller is not listening to the change of Podgroup, so that the reconcile logic could not be triggered, and the job had no pod.

cheimu · 2022-04-16T18:23:21Z

Hi @qiankunli . I don't know if we are on the same page, but I think the base job controller has already tried to reconcile podGroup https://github.com/kubeflow/common/blob/master/pkg/controller.v1/common/job.go#L256

qiankunli · 2022-04-17T05:00:47Z

@cheimu I mean, if the state of podgroup changed from pending to inqueue, the reconcile should be triggered. when podgroup is pending, https://github.com/kubeflow/common/blob/master/pkg/controller.v1/common/job.go#L256 break the reconcile. when podgroup is inqueue, the reconcile can not be triggered, so the ReconcilePods and ReconcileServices can not be reached.

zw0610 · 2022-04-17T12:51:37Z

It is true that PodGroup is not watched to trigger reconciliation but the controller does resync after certain period. You may change to a short time for the resynchronization.

gaocegege · 2022-04-18T01:28:22Z

I do not think the resync works here since podgroup is per job and should be watched in the controller.

johnugeorge · 2023-01-21T19:43:45Z

Closing this issue as feature is implemented as part of #1724 #1666
/close

qiankunli mentioned this issue Apr 17, 2022

support successPolicy and failurePolicy on pytorchjob #1575

Closed

johnugeorge mentioned this issue Nov 2, 2022

Training operator 1.6 Roadmap #1683

Closed

9 tasks

johnugeorge closed this as completed Jan 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trainning-operator may need to monitor PodGroup #1574

trainning-operator may need to monitor PodGroup #1574

qiankunli commented Apr 16, 2022

cheimu commented Apr 16, 2022 •

edited

Loading

qiankunli commented Apr 17, 2022 •

edited

Loading

zw0610 commented Apr 17, 2022

gaocegege commented Apr 18, 2022

johnugeorge commented Jan 21, 2023

trainning-operator may need to monitor PodGroup #1574

trainning-operator may need to monitor PodGroup #1574

Comments

qiankunli commented Apr 16, 2022

cheimu commented Apr 16, 2022 • edited Loading

qiankunli commented Apr 17, 2022 • edited Loading

zw0610 commented Apr 17, 2022

gaocegege commented Apr 18, 2022

johnugeorge commented Jan 21, 2023

cheimu commented Apr 16, 2022 •

edited

Loading

qiankunli commented Apr 17, 2022 •

edited

Loading