Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: Implement skeleton webhook servers #2251

Merged

Conversation

tenzen-y
Copy link
Member

@tenzen-y tenzen-y commented Sep 6, 2024

What this PR does / why we need it:
I implemented the skeleton webhook servers for v2 operator.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Part-of #2209

Checklist:

  • Docs included if any changes are user facing

@coveralls
Copy link

coveralls commented Sep 6, 2024

Pull Request Test Coverage Report for Build 10778925774

Details

  • 21 of 21 (100.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 100.0%

Totals Coverage Status
Change from base Build 10739885471: 0.0%
Covered Lines: 66
Relevant Lines: 66

💛 - Coveralls

@tenzen-y
Copy link
Member Author

tenzen-y commented Sep 6, 2024

/assign @kubeflow/wg-training-leads

@tenzen-y
Copy link
Member Author

tenzen-y commented Sep 6, 2024

I will delegate the actual validation implementations to other contributors who are interested in this project.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @tenzen-y!
I left a few comments.

prometheus.io/port: "8080"
labels:
app: training-operator
name: training-operator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we discussed before that we want to allow to deploy V1 and V2 operator, so I suggest we name it as:

Suggested change
name: training-operator
name: training-operator-v2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Service was removed based on #2251 (comment) discussion.

Comment on lines 8 to 9
labels:
app: training-operator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the appropriate label here:

Suggested change
labels:
app: training-operator
labels:
training.kubeflow.org/component: controller

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I was not supposed to add this file in this PR. I would like to add Service and Deployment manifests in #2208.

So, could we remove Service and Deployment from this PR so that we can focus on setting up skeleton webhook servers?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, we can add them in a separate PR.

protocol: TCP
targetPort: 9443
selector:
control-plane: kubeflow-training-operator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
control-plane: kubeflow-training-operator
training.kubeflow.org/component: controller

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Service was removed based on #2251 (comment) discussion.

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
name: mutating-webhook-configuration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering for what are we planning to use Mutating Webhook ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, we need to default .spec.mlPolicy.numNodes to 1, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but can we set these values via CEL or we require Mutating Webhook in any case if we want to add default values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. In that case, let's decide if we should introduce the webhook mutator during #2209.

So, we can remove the mutator for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
name: validating-webhook-configuration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call it similar to Katib webhooks ?

Suggested change
name: validating-webhook-configuration
name: training.kubeflow.org

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is auto-generated name by controller-gen.
So, we can not change this name.

Copy link
Member

@andreyvelich andreyvelich Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that we discussed before that we can rename it via Kustomize patch: https://github.com/kubeflow/training-operator/blob/master/manifests/base/webhook/patch.yaml.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right. But we do not add any manifests in this PR because I added only manifests generated by controller-gen so that we can start up the webhook servers during envtest.

In other words, these manifests are at least manifests to launch webhook servers during integration testing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I think #2208 has the responsibility to add manifests not generated by controller-tools. Any thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. @tenzen-y Please can you add the comment in #2208 that we should add the Kustomize patch to update those names ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I will describe the detailed task in #2208.

namespace: system
path: /validate-kubeflow-org-v2alpha1-clustertrainingruntime
failurePolicy: Fail
name: vclustertrainingruntime.training-operator.kubeflow.org
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we simplify it ?

Suggested change
name: vclustertrainingruntime.training-operator.kubeflow.org
name: clustertrainingruntime.training.kubeflow.org

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid the conflict between the defaulter and the validator, let's use the "validator" prefix.
Additionally, could we use the "training-operaotr.kubeflow.org" domain since we have 2 components, training-operator and mpi-operator for the training?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid the conflict between the defaulter and the validator, let's use the "validator" prefix

What do you think about the full names, like we did for Katib webhooks:

validator.clustertrainingruntime.kubeflow.org
defaulter.clustertrainingruntime.kubeflow.org

Additionally, could we use the "training-operaotr.kubeflow.org" domain since we have 2 components, training-operator and mpi-operator for the training?

Should we use mpi.kubeflow.org for MPI Operator webhooks for now ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it could work fine now.

The TainJob and TrainingRuntime are still in the alpha stage. So, it is easy to change the name once we face conflicts with other kubeflow projects.

Indeed, the MPIOperator does not have any webhooks now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

namespace: system
path: /validate-kubeflow-org-v2alpha1-trainingruntime
failurePolicy: Fail
name: vtrainingruntime.training-operator.kubeflow.org
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: vtrainingruntime.training-operator.kubeflow.org
name: trainingruntime.training.kubeflow.org

namespace: system
path: /validate-kubeflow-org-v2alpha1-trainjob
failurePolicy: Fail
name: vtrainjob.training-operator.kubeflow.org
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: vtrainjob.training-operator.kubeflow.org
name: trainjob.training.kubeflow.org

return "TrainingRuntime", err
}
if err := setupWebhookForTrainJob(mgr); err != nil {
return "TranJob", err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move those Kind types to constants ?

Suggested change
return "TranJob", err
return "TrainJob", err

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -0,0 +1,22 @@
apiVersion: v1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to create deployment for controller as well ?
Also, what do you think about moving service and controller under /base/controller or /base/manager folder ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Service was removed based on #2251 (comment) discussion.

@tenzen-y tenzen-y force-pushed the implement-sckelton-webhook-servers branch from a74d4db to 967b68d Compare September 9, 2024 16:42
@google-oss-prow google-oss-prow bot added size/L and removed size/XL labels Sep 9, 2024
@tenzen-y
Copy link
Member Author

tenzen-y commented Sep 9, 2024

@andreyvelich I addressed the comments. The only remaining point is #2251 (comment).

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
@tenzen-y tenzen-y force-pushed the implement-sckelton-webhook-servers branch from 967b68d to 3538acd Compare September 9, 2024 18:18
@tenzen-y
Copy link
Member Author

tenzen-y commented Sep 9, 2024

@andreyvelich I addressed all comments. PTAL.

Comment on lines +23 to +30
return "ClusterTrainingRuntime", err
}
if err := setupWebhookForTrainingRuntime(mgr); err != nil {
return "TrainingRuntime", err
}
if err := setupWebhookForTrainJob(mgr); err != nil {
return "TrainJob", err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y Do you return these names only to handle this error: https://github.com/kubeflow/training-operator/blob/master/cmd/training-operator.v2alpha1/main.go#L150C5-L152 ?
Do you think it is necessary ?
If some of the webhook will fail, we will get this info since we traceback error from the underlying file: ClusterTrainingRuntime, TrainingRuntime, or TrainJob.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe that this name is helpful because this is not for validation error, this is for set up error.
This error can be seen when the startup webhook server failed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, but without this name, can we see the file traceback of this error during the webhook setup ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that the traceback will show the path where the error occurs. However, I still believe this helps understand the components in which failure occurs since this component name (ex. TrainJob) can be shown as a top-level log, and this top-level log allows us to avoid digging into nested traceback logs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, sounds good!

pkg/webhook.v2/trainingruntime_webhook.go Show resolved Hide resolved
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @tenzen-y!
/lgtm
/approve

Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 7c8d4df into kubeflow:master Sep 9, 2024
39 checks passed
@tenzen-y tenzen-y deleted the implement-sckelton-webhook-servers branch September 9, 2024 21:01
@tenzen-y
Copy link
Member Author

tenzen-y commented Sep 9, 2024

Thank you for this @tenzen-y! /lgtm /approve

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants