-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fully consolidate tf-operator to training-operator #1727
Comments
/kind feature |
Related: #1714 |
/assign |
In the 1. Conditions whether the controller restart failed pods
training-operator/pkg/controller.v1/tensorflow/tfjob_controller.go Lines 802 to 806 in 8a066f9
2. Report metrics to Prometheus when pods are restarted
3. Update Job Condition when the controller restarts failed pods
training-operator/pkg/controller.v1/tensorflow/tfjob_controller.go Lines 812 to 817 in 8a066f9
None |
I think that the TFJob behavior is correct in all the different points. @kubeflow/wg-training-leads @zw0610 WDYT? |
@kubeflow/wg-training-leads @zw0610 Please let me know what you think. |
@tenzen-y Sorry for late response. Thanks for pointing out. Yes, TFJob implementation is the correct one for all 3 points. What changes do you propose for common? This needs to be fixed. Either before or after merge of common to training operator repo |
@johnugeorge Thank you for the confirmation.
Yes, we should fix that. I would adopt the difference of 3 points to TFJob implementations in the common repo. training-operator/pkg/controller.v1/tensorflow/tfjob_controller.go Lines 802 to 806 in 8a066f9
However, I'm on the fence about whether we should fix that before the merge of common. After the merge of common fixing that might be more safety. What do you think? |
Sounds good. I have started working on common merge changes. |
Great! Please let me know if you create a PR to merge the common into this repository. I will review the PR. |
Through more work, I found the Regarding |
We must fully consolidate tf-operator to training-operator since we use duplicated functions with kubeflow/common.
For example:
training-operator/pkg/controller.v1/tensorflow/tfjob_controller.go
Lines 712 to 716 in ddf372c
The text was updated successfully, but these errors were encountered: