Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement webhook validation for the TFJob #2051

Merged

Conversation

tenzen-y
Copy link
Member

What this PR does / why we need it:
I implemented webhook validations for the TFJob.
Additionally, I didn't add any additional validations. The traininig-operator has the same validations the same as before.

Although we potentially commonize podTemplate validations, I leave such commonize in the follow-ups.

This PR depends on #2035

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Part-of #1993

Checklist:

  • Docs included if any changes are user facing

Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tenzen-y
Copy link
Member Author

/hold
for review

@tenzen-y
Copy link
Member Author

/assign @andreyvelich @johnugeorge

This is the next PR to introduce webhook validations. PTAL, thanks.

@johnugeorge
Copy link
Member

Some tests are timing out

@tenzen-y
Copy link
Member Author

Some tests are timing out

I guess that some tests reach GHA limitations...

@tenzen-y
Copy link
Member Author

Some tests are timing out

I guess that some tests reach GHA limitations...

This error seems to be related it: docker/buildx#841

@johnugeorge
Copy link
Member

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label Apr 10, 2024
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @tenzen-y!
A few comments.

Comment on lines +126 to +129
func IsChiefOrMaster(typ ReplicaType) bool {
return typ == TFJobReplicaTypeChief || typ == TFJobReplicaTypeMaster
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y Does it make sense to move this function in webhooks/tensorflow/tfjob_webhook.go since we use it there and it will help us to avoid creation of additional file named tensorflow_types_test.go?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I'm wondering if we can put this function in the webhook package.
However, I decided to put this here since this function is not only for the webhooks like this: https://github.com/search?q=repo%3Akubeflow%2Ftraining-operator%20IsChiefOrMaster&type=code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y Do we really need this check in the controller if we already verify if TFJob contains Chief or Master here: https://github.com/kubeflow/training-operator/blob/c20422067e3ef81df39d03c6f285353344d8f77d/pkg/controller.v1/tensorflow/tfjob_controller.go#L452C6-L452C31 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need it since the webhook verifies only if the master/chief role replica is 0 or 1.
When TFJob doesn't have any master/chief role, the training operator switches there based on the check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, since we need additional check to verify if rtype is Chief of Master

Comment on lines +73 to +75
if errors := apimachineryvalidation.NameIsDNS1035Label(job.Name, false); len(errors) != 0 {
allErrs = append(allErrs, field.Invalid(field.NewPath("metadata").Child("name"), job.Name, fmt.Sprintf("should match: %v", strings.Join(errors, ","))))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this validation to the helper function in /webhooks/utils/utils.go since this check is common to all Framework ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can commonize these functions, but as I mentioned in the description, commonizing in a separate PR would be better. Because the minimum validation change could prove that webhook validations would not bring any bugs/breaks.

Although we potentially commonize podTemplate validations, I leave such commonize in the follow-ups.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that sounds good @tenzen-y!
We might need to track it in the separate issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created: #2054

Comment on lines +104 to +111
for idx, container := range rSpec.Template.Spec.Containers {
if container.Image == "" {
allErrs = append(allErrs, field.Required(containerPath.Index(idx).Child("image"), "must be required"))
}
if container.Name == trainingoperator.TFJobDefaultContainerName {
defaultContainerPresent = true
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above comment maybe that could be moved to the helper function in which we pass container name for the check.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tenzen-y!
/lgtm

Comment on lines +126 to +129
func IsChiefOrMaster(typ ReplicaType) bool {
return typ == TFJobReplicaTypeChief || typ == TFJobReplicaTypeMaster
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, since we need additional check to verify if rtype is Chief of Master

@tenzen-y
Copy link
Member Author

Thanks all!

/hold cancel

@google-oss-prow google-oss-prow bot merged commit 1a73cf1 into kubeflow:master Apr 10, 2024
73 of 105 checks passed
@tenzen-y tenzen-y deleted the implement-tfjob-webhook-validations branch April 10, 2024 20:35
johnugeorge pushed a commit to johnugeorge/training-operator that referenced this pull request Apr 28, 2024
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
johnugeorge pushed a commit to johnugeorge/training-operator that referenced this pull request Apr 28, 2024
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants