-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2170: Implement skeleton webhook servers #2251
KEP-2170: Implement skeleton webhook servers #2251
Conversation
Pull Request Test Coverage Report for Build 10778925774Details
💛 - Coveralls |
/assign @kubeflow/wg-training-leads |
I will delegate the actual validation implementations to other contributors who are interested in this project. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @tenzen-y!
I left a few comments.
manifests/v2/base/service.yaml
Outdated
prometheus.io/port: "8080" | ||
labels: | ||
app: training-operator | ||
name: training-operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, we discussed before that we want to allow to deploy V1 and V2 operator, so I suggest we name it as:
name: training-operator | |
name: training-operator-v2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Service was removed based on #2251 (comment) discussion.
manifests/v2/base/service.yaml
Outdated
labels: | ||
app: training-operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use the appropriate label here:
labels: | |
app: training-operator | |
labels: | |
training.kubeflow.org/component: controller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I make it similar to Katib: https://github.com/kubeflow/katib/blob/master/manifests/v1beta1/components/controller/controller.yaml#L17
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I was not supposed to add this file in this PR. I would like to add Service and Deployment manifests in #2208.
So, could we remove Service and Deployment from this PR so that we can focus on setting up skeleton webhook servers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, we can add them in a separate PR.
manifests/v2/base/service.yaml
Outdated
protocol: TCP | ||
targetPort: 9443 | ||
selector: | ||
control-plane: kubeflow-training-operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
control-plane: kubeflow-training-operator | |
training.kubeflow.org/component: controller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Service was removed based on #2251 (comment) discussion.
apiVersion: admissionregistration.k8s.io/v1 | ||
kind: MutatingWebhookConfiguration | ||
metadata: | ||
name: mutating-webhook-configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering for what are we planning to use Mutating Webhook ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, we need to default .spec.mlPolicy.numNodes to 1, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but can we set these values via CEL or we require Mutating Webhook in any case if we want to add default values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. In that case, let's decide if we should introduce the webhook mutator during #2209.
So, we can remove the mutator for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
apiVersion: admissionregistration.k8s.io/v1 | ||
kind: ValidatingWebhookConfiguration | ||
metadata: | ||
name: validating-webhook-configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call it similar to Katib webhooks ?
name: validating-webhook-configuration | |
name: training.kubeflow.org |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is auto-generated name by controller-gen.
So, we can not change this name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that we discussed before that we can rename it via Kustomize patch: https://github.com/kubeflow/training-operator/blob/master/manifests/base/webhook/patch.yaml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's right. But we do not add any manifests in this PR because I added only manifests generated by controller-gen so that we can start up the webhook servers during envtest.
In other words, these manifests are at least manifests to launch webhook servers during integration testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I think #2208 has the responsibility to add manifests not generated by controller-tools. Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I will describe the detailed task in #2208.
namespace: system | ||
path: /validate-kubeflow-org-v2alpha1-clustertrainingruntime | ||
failurePolicy: Fail | ||
name: vclustertrainingruntime.training-operator.kubeflow.org |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we simplify it ?
name: vclustertrainingruntime.training-operator.kubeflow.org | |
name: clustertrainingruntime.training.kubeflow.org |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid the conflict between the defaulter and the validator, let's use the "validator" prefix.
Additionally, could we use the "training-operaotr.kubeflow.org" domain since we have 2 components, training-operator and mpi-operator for the training?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid the conflict between the defaulter and the validator, let's use the "validator" prefix
What do you think about the full names, like we did for Katib webhooks:
validator.clustertrainingruntime.kubeflow.org
defaulter.clustertrainingruntime.kubeflow.org
Additionally, could we use the "training-operaotr.kubeflow.org" domain since we have 2 components, training-operator and mpi-operator for the training?
Should we use mpi.kubeflow.org
for MPI Operator webhooks for now ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it could work fine now.
The TainJob and TrainingRuntime are still in the alpha stage. So, it is easy to change the name once we face conflicts with other kubeflow projects.
Indeed, the MPIOperator does not have any webhooks now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
namespace: system | ||
path: /validate-kubeflow-org-v2alpha1-trainingruntime | ||
failurePolicy: Fail | ||
name: vtrainingruntime.training-operator.kubeflow.org |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name: vtrainingruntime.training-operator.kubeflow.org | |
name: trainingruntime.training.kubeflow.org |
namespace: system | ||
path: /validate-kubeflow-org-v2alpha1-trainjob | ||
failurePolicy: Fail | ||
name: vtrainjob.training-operator.kubeflow.org |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name: vtrainjob.training-operator.kubeflow.org | |
name: trainjob.training.kubeflow.org |
pkg/webhook.v2/setup.go
Outdated
return "TrainingRuntime", err | ||
} | ||
if err := setupWebhookForTrainJob(mgr); err != nil { | ||
return "TranJob", err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move those Kind types to constants ?
return "TranJob", err | |
return "TrainJob", err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
manifests/v2/base/service.yaml
Outdated
@@ -0,0 +1,22 @@ | |||
apiVersion: v1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to create deployment for controller as well ?
Also, what do you think about moving service and controller under /base/controller
or /base/manager
folder ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Service was removed based on #2251 (comment) discussion.
a74d4db
to
967b68d
Compare
@andreyvelich I addressed the comments. The only remaining point is #2251 (comment). |
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
967b68d
to
3538acd
Compare
@andreyvelich I addressed all comments. PTAL. |
return "ClusterTrainingRuntime", err | ||
} | ||
if err := setupWebhookForTrainingRuntime(mgr); err != nil { | ||
return "TrainingRuntime", err | ||
} | ||
if err := setupWebhookForTrainJob(mgr); err != nil { | ||
return "TrainJob", err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y Do you return these names only to handle this error: https://github.com/kubeflow/training-operator/blob/master/cmd/training-operator.v2alpha1/main.go#L150C5-L152 ?
Do you think it is necessary ?
If some of the webhook will fail, we will get this info since we traceback error from the underlying file: ClusterTrainingRuntime, TrainingRuntime, or TrainJob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe that this name is helpful because this is not for validation error, this is for set up error.
This error can be seen when the startup webhook server failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, but without this name, can we see the file traceback of this error during the webhook setup ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that the traceback will show the path where the error occurs. However, I still believe this helps understand the components in which failure occurs since this component name (ex. TrainJob) can be shown as a top-level log, and this top-level log allows us to avoid digging into nested traceback logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this @tenzen-y!
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thank you! |
What this PR does / why we need it:
I implemented the skeleton webhook servers for v2 operator.
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):Part-of #2209
Checklist: