Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V1: Introduce the CustomResourceValidationExpressions feature (CEL validation) #1708

Open
tenzen-y opened this issue Dec 23, 2022 · 25 comments

Comments

@tenzen-y
Copy link
Member

/kind feature

Since k8s v1.25, the CustomResourceValidationExpressions feature to validate CRDs using Common Expression Language (CEL) without webhook servers is enabled by default.

Since this feature helps to find CRD validation errors for users, we need to introduce that feature once we stop supporting K8s v1.24.

For example, if the container name of replicaSpec is invalid, for now, we only output that error to controller logs; once we introduce CEL validation, we can return that error to end-users.

// Make sure there has at least one container named "tensorflow".
if numNamedTensorflow == 0 {
msg := fmt.Sprintf("TFJobSpec is not valid: There is no container named %s in %v", TFJobDefaultContainerName, rType)
log.Error(msg)
return fmt.Errorf(msg)
}

@tenzen-y
Copy link
Member Author

/assign

@tenzen-y
Copy link
Member Author

tenzen-y commented Aug 2, 2023

Through my investigation, I found that we can not replace all validations with CEL validation due to exceeding cost budget.

For example, we can replace the below validation

if container.Image == "" {
msg := fmt.Sprintf("PyTorchJobSpec is not valid: Image is undefined in the container of %v", rType)
return fmt.Errorf(msg)
}

with the below CEL validation:

// ReplicaSpec is a description of the replica
type ReplicaSpec struct {
...
	// Template is the object that describes the pod that
	// will be created for this replica. RestartPolicy in PodTemplateSpec
	// will be overide by RestartPolicy in ReplicaSpec
	// +kubebuilder:validation:XValidation:rule="has(self.spec.containers) && self.spec.containers.all(c, has(c.image) && size(c.image) > 0)",message=""
	Template v1.PodTemplateSpec `json:"template,omitempty"`
...

Forbidden: contributed to estimated rule cost total exceeding cost limit for entire OpenAPIv3 schema, spec.validation.openAPIV3Schema: Forbidden: x-kubernetes-validations estimated rule cost total for entire OpenAPIv3 schema exceeds budget by factor of more than 100x (try simplifying the rule, or adding maxItems, maxProperties, and maxLength where arrays, maps, and strings are declared)]

So, I think that we need to introduce the webhook validation instead of replacing the current validation with CEL validation.

@kubeflow/wg-training-leads WDYT?

@johnugeorge
Copy link
Member

I remember, we had a discussion regarding webhooks. The arguments were deployment easiness(without web hooks) vs clean validation(with webhooks)

@tenzen-y
Copy link
Member Author

tenzen-y commented Aug 3, 2023

The arguments were deployment easiness

@johnugeorge Are they concerned about certs for the webhook?

@johnugeorge
Copy link
Member

Yes. That was the major raod block from what I remember

@tenzen-y
Copy link
Member Author

tenzen-y commented Aug 4, 2023

@johnugeorge I think we can remove webhook installation barriers once we introduce cert generation logic similar to katib. WDYT?

Actually, I have experience implementing internal cert generation logic in 2 components (kubeflow/katib, kubernetes-sigs/kueue).

@tenzen-y
Copy link
Member Author

tenzen-y commented Aug 4, 2023

@johnugeorge
Copy link
Member

@tenzen-y Shall we move this to the next release?

@tenzen-y
Copy link
Member Author

tenzen-y commented Aug 4, 2023

@tenzen-y Shall we move this to the next release?

Which does that mean training-operator v1.7 or v1.8?

@johnugeorge
Copy link
Member

We are cutting first 1.7 RC in few days. I am afraid, we cannot complete testing within time if we want to aim this feature in 1.7. What are you thoughts on moving to 1.8?

@tenzen-y
Copy link
Member Author

tenzen-y commented Aug 4, 2023

We are cutting first 1.7 RC in few days. I am afraid, we cannot complete testing within time if we want to aim this feature in 1.7. What are you thoughts on moving to 1.8?

It makes sense. We can work on this feature for v1.8. Actually, I don't have enough bandwidth for this feature.

@tenzen-y
Copy link
Member Author

tenzen-y commented Aug 4, 2023

Also, we can create another issue for the webhook since this issue aims CEL validation.

@andreyvelich
Copy link
Member

Thank you for creating this @tenzen-y.
I agree that we should postpone the discussion around this.
We need to identify pros and cons of using CEL vs Webhook Validation.
As Johnu mentioned before, adding Webhook as a mandatory component for Training Operator will complicate end-user usage.

@tenzen-y
Copy link
Member Author

tenzen-y commented Aug 4, 2023

Anyway, I will create an issue to discuss webhook since CEL validation isn't enough due to exceeding the cost budge.

#1708 (comment)

@tenzen-y
Copy link
Member Author

/remove-area 1.7.0

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@tenzen-y
Copy link
Member Author

/remove-lifecycle stale

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@tenzen-y
Copy link
Member Author

/remove-lifecycle stale

@johnugeorge
Copy link
Member

Moved to 1.9 release

@tenzen-y
Copy link
Member Author

Moved to 1.9 release

I agreed with this decision in the offline meeting with @johnugeorge.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@tenzen-y
Copy link
Member Author

/remove-lifecycle stale

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@tenzen-y
Copy link
Member Author

/lifecycle frozen

@tenzen-y tenzen-y changed the title Introduce the CustomResourceValidationExpressions feature (CEL validation) V1: Introduce the CustomResourceValidationExpressions feature (CEL validation) Oct 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants