-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trainning-operator may need to monitor PodGroup #1574
Comments
Hi @qiankunli . I don't know if we are on the same page, but I think the base job controller has already tried to reconcile podGroup https://github.com/kubeflow/common/blob/master/pkg/controller.v1/common/job.go#L256 |
@cheimu I mean, if the state of podgroup changed from pending to inqueue, the reconcile should be triggered. when podgroup is pending, https://github.com/kubeflow/common/blob/master/pkg/controller.v1/common/job.go#L256 break the reconcile. when podgroup is inqueue, the reconcile can not be triggered, so the ReconcilePods and ReconcileServices can not be reached. |
It is true that PodGroup is not watched to trigger reconciliation but the controller does resync after certain period. You may change to a short time for the resynchronization. |
I do not think the resync works here since podgroup is per job and should be watched in the controller. |
I create a pytorchjob, then pytorch-controller create a PodGroup. The PodGroup was pending state at the beginning, so no pod was created for the pytorchjob. After the PodGroup became the inqueue state, It seems that the controller is not listening to the change of Podgroup, so that the reconcile logic could not be triggered, and the job had no pod.
The text was updated successfully, but these errors were encountered: