Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate longest generated pod name will not exceed 63 characters #408

Closed
danielvegamyhre opened this issue Feb 13, 2024 · 0 comments · Fixed by #409
Closed

Validate longest generated pod name will not exceed 63 characters #408

danielvegamyhre opened this issue Feb 13, 2024 · 0 comments · Fixed by #409

Comments

@danielvegamyhre
Copy link
Contributor

danielvegamyhre commented Feb 13, 2024

Problem

In the JobSet webhook, we currently only validate the longest child job that will be created for this JobSet is DNS 1035 compliant, part of which is enforcing a max length of 63 characters. This constraint is enforced by the Job controller, so we reject the JobSet creation request if it would result in attempting to create child jobs with names longer than 63 characters.

However, we currently have no similar validation logic for pod names, which also have a max size of 63 characters. We previously thought there was validation in the Job API that would throw an error if the pod name it is attempting to create for a Job is longer than 63 characters, but it turns out that it actually only validates the pod hostname has a max length of 63 characters (which does NOT include the 5 character random suffix); for generating the pod name itself, it actually truncates the job name to make room for the pod index and 5 character random suffix.

This led to an issue where a long JobSet name results in pod names longer than 63 characters, so the Job controller truncates them before creating the pods, which breaks JobSets using JobSet’s exclusive Job per topology feature.

This is because the JobSet pod admission webhook blocks creation of all follower pods (pods with completion index != 0) until the leader pod (completion index 0) for that Job is scheduled, then it admits the pod creation requests and injects nodeSelectors into their specs, which enforce scheduling constraints ensuring the follower pods land in the same topology as their leader pod.

The way it does this depends on the leader pod name following the expected format: [jobset name]-[job index]-[pod index-[random 5 character suffix], and maintains an index mapping [pod name without random suffix] to the k8s corev1.Pod object. The pod webhook can then query the leader pod name without the random suffix to get the leader Pod object and check if it is scheduled, and have conditional logic based on this.

Therefore, since the leader pod name was truncated before creation and no longer follows the expected format, the pod webhook cannot look up the leader pod, and blocks follower pods from creation indefinitely, preventing the JobSet from running properly.

Solution:

  • Short-term: Add JobSet validation which checks the longest pod name that will be generated will be less than or equal to 63 characters. If not, reject the JobSet creation and return a clear, user-facing error message.
    Long-term:
  • Long-term: Explore ways to look up the leader pod without depending on the pod name, since this is somewhat brittle and there are no upstream guarantees about the pod suffix length, among other things (see here).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant