[JobSet v0.3.0 Bug] Webhook failing to start and somehow blocking regular Jobs not owned by JobSets from creating pods #361

danielvegamyhre · 2023-12-21T17:47:43Z

This problem doesn't happen consistently, as I manually tested JobSet v0.3.0 multiple times successfully. However, a user today is running into a concerning error. After installing v0.3.0, the JobSet deployment manager fails to start up, then somehow when they submit a regular Job (not owned by a JobSet), the Job cannot create any pods due to an error calling the JobSet webhook (??).

Steps to reproduce:

Deploy JobSet v0.3.0

Jobset controller manager not starting properly, no endpoints for webhook service:

kubectl describe rs -n jobset-system

...
Events:
  Type     Reason        Age                  From                   Message
  ----     ------        ----                 ----                   -------
  Warning  FailedCreate  8m57s (x47 over 8h)  replicaset-controller  Error creating: Internal error occurred: failed calling webhook "mpod.kb.io": failed to call webhook: Post "https://jobset-webhook-service.jobset-system.svc:443/mutate--v1-pod?timeout=10s": no endpoints available for service "jobset-webhook-service"

Create a standalone Job. Example Job + headless service spec being deployed by the user encountering this issue:

apiVersion: v1
kind: Service
metadata:
  name: headless-svc
spec:
  clusterIP: None
  selector:
    job-name: tpu-job
---
apiVersion: batch/v1
kind: Job
metadata:
  name: tpu-job
spec:
  backoffLimit: 0
  # Completions and parallelism should be the number of chips divided by 4.
  # (e.g. 4 for a v5litepod-16)
  completions: 4
  parallelism: 4
  completionMode: Indexed
  template:
    spec:
      subdomain: headless-svc
      restartPolicy: Never
      terminationGracePeriodSeconds: 300
      nodeSelector:
        cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
        cloud.google.com/gke-tpu-topology: 2x2x1
      containers:
      - name: tpu-job
        image: python:3.10
        ports:
        - containerPort: 8471 # Default port using which TPU VMs communicate
        - containerPort: 8431 # Port to export TPU usage metrics, if supported
        securityContext:
          privileged: true
        command:
        - bash
        - -c
        - |
          printenv
          pip install "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
          python -c 'import jax; print("JAX Devices:", jax.devices(), "Global device count:", jax.device_count(), "Local device count:", jax.local_device_count())'
        resources:
          requests:
            google.com/tpu: 4
          limits:
            google.com/tpu: 4

Describing the Job after deploying it:

kubectl describe job tpu-job


Name:               tpu-job
Namespace:          default
Selector:           batch.kubernetes.io/controller-uid=d0adcb30-8b41-4122-9751-2bb502f845d9
Labels:             batch.kubernetes.io/controller-uid=d0adcb30-8b41-4122-9751-2bb502f845d9
                    batch.kubernetes.io/job-name=tpu-job
                    controller-uid=d0adcb30-8b41-4122-9751-2bb502f845d9
                    job-name=tpu-job
Annotations:        <none>
Parallelism:        4
Completions:        4
Completion Mode:    Indexed
Start Time:         Thu, 21 Dec 2023 17:31:55 +0000
Pods Statuses:      0 Active (0 Ready) / 0 Succeeded / 0 Failed
Completed Indexes:  <none>
Pod Template:
  Labels:  batch.kubernetes.io/controller-uid=d0adcb30-8b41-4122-9751-2bb502f845d9
           batch.kubernetes.io/job-name=tpu-job
           controller-uid=d0adcb30-8b41-4122-9751-2bb502f845d9
           job-name=tpu-job
  Containers:
   tpu-job:
    Image:       python:3.10
    Ports:       8471/TCP, 8431/TCP
    Host Ports:  0/TCP, 0/TCP
    Command:
      bash
      -c
      printenv
      pip install "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
      python -c 'import jax; print("JAX Devices:", jax.devices(), "Global device count:", jax.device_count(), "Local device count:", jax.local_device_count())'
      
    Limits:
      google.com/tpu:  4
    Requests:
      google.com/tpu:  4
    Environment:
      TPU_WORKER_HOSTNAMES:  tpu-job-0.headless-svc,tpu-job-1.headless-svc,tpu-job-2.headless-svc,tpu-job-3.headless-svc
      TPU_WORKER_ID:          (v1:metadata.annotations['batch.kubernetes.io/job-completion-index'])
    Mounts:                  <none>
  Volumes:                   <none>
Events:
  Type     Reason        Age                From            Message
  ----     ------        ----               ----            -------
  Warning  FailedCreate  30s (x6 over 61s)  job-controller  Error creating: Internal error occurred: failed calling webhook "mpod.kb.io": failed to call webhook: Post "https://jobset-webhook-service.jobset-system.svc:443/mutate--v1-pod?timeout=10s": no endpoints available for service "jobset-webhook-service"
$:~/go/src/sigs.k8s.io/jobset$ k get pods -n jobset-system

The text was updated successfully, but these errors were encountered:

danielvegamyhre · 2023-12-21T17:47:50Z

/kind bug

danielvegamyhre · 2023-12-21T17:51:35Z

I think I see the issue. The mutating pod webhook is selecting all pods, and has no constraints on them being owned by a JobSet. We need to update it to add that constraint.

jobset/pkg/webhooks/pod_mutating_webhook.go

Line 47 in 29971f2

For(&corev1.Pod{}).

danielvegamyhre · 2023-12-21T17:55:56Z

Hmm the thing is, the pod webhook has logic which skips pods that aren't owned by a JobSet.

jobset/pkg/webhooks/pod_admission_webhook.go

Line 31 in 29971f2

if _, isJobSetPod := pod.Annotations[jobset.JobSetNameKey]; !isJobSetPod {

It's just when defining a webhook you have to declare the object type it's for, which is pod in this case.

So if the JobSet controller manager can't find a node to run on, but all pods in the cluster being created are trying to go through that webhook, this can happen?

danielvegamyhre · 2023-12-21T17:58:01Z

I think we may be able to add filters to the webhook NewWebhookManagedBy() similar to what we do in the controllers, to avoid this issue... I'll look into this

kannon92 · 2023-12-21T18:04:04Z

This sounds like the issues @dejanzele and I have been seeing.

I think something is up with the cert-rotator we use with newer versions of kind. I think its related to the flakiness in the e2e tests.

So sometimes we are not able to deploy the jobset deployment and I see failures in kube-api-controller-manager around the webhooks for the pods. As you stated, the deployment is stuck and you see failures to the pod webhook.

In our e2e tests you can see errors like this:

2023-12-21T12:39:30.271766422Z stderr F I1221 12:39:30.271589       1 event.go:376] "Event occurred" object="jobset-system/jobset-controller-manager" fieldPath="" kind="Deployment" apiVersion="apps/v1" type="Normal" reason="ScalingReplicaSet" message="Scaled up replica set jobset-controller-manager-6ff5cc5557 to 1"
2023-12-21T12:39:31.328311043Z stderr F I1221 12:39:31.328106       1 event.go:376] "Event occurred" object="jobset-system/jobset-controller-manager-6ff5cc5557" fieldPath="" kind="ReplicaSet" apiVersion="apps/v1" type="Warning" reason="FailedCreate" message="Error creating: Internal error occurred: failed calling webhook \"mpod.kb.io\": failed to call webhook: Post \"https://jobset-webhook-service.jobset-system.svc:443/mutate--v1-pod?timeout=10s\": dial tcp 10.96.98.174:443: connect: connection refused"

See https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_jobset/359/pull-jobset-test-e2e-main-1-29/1737813742327959552/artifacts/kind-control-plane/pods/kube-system_kube-controller-manager-kind-control-plane_48fd15112471393d2f455b3c36475666/kube-controller-manager/0.log

danielvegamyhre · 2023-12-21T18:08:13Z

The pod webhook is selecting all pods, and has no constraints on them being owned by a JobSet in the webhook definition itself. We only filter pods not owned by a jobset in the Go code itself, in the admission stage of the webhook. So if the webhook is installed but unable to run anywhere, this will block all pods in the cluster from being created.

danielvegamyhre · 2023-12-21T18:09:39Z

We need to be able to add an objectSelector to the webhook to select pods with certain labels (indicating they are part of a JobSet), but kubebuilder annotations do not support this: https://book.kubebuilder.io/reference/markers/webhook

danielvegamyhre · 2023-12-21T18:21:12Z

@kannon92 @ahg-g since kubebuilder markers doesn't support objectSelectors, the only way I can think of to add them into the manifest generated from the markers is to run a python script which injects it as a build step. What do you think?

Actually maybe there's a way to do it with kustomize?

ahg-g · 2023-12-21T19:05:39Z

Yes, I totally forgot about the objectSelector which we used to add manually during testing. We absolutely need to add it to avoid making the jobset operator a single point of failure for the whole cluster.

danielvegamyhre · 2023-12-21T19:10:46Z

I asked in the kubebuilder slack channel it sounds like there may be a way to inject the objectSelector using a kubebuilder JSON Patch. I'm looking into it now.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 21, 2023

danielvegamyhre self-assigned this Dec 21, 2023

danielvegamyhre mentioned this issue Dec 21, 2023

Add patches for Kustomize to add objectSelectors to pod webhook configurations #362

Merged

k8s-ci-robot closed this as completed in #362 Dec 21, 2023

danielvegamyhre mentioned this issue Dec 21, 2023

Automated cherry pick of #362 upstream release 0.3 #364

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JobSet v0.3.0 Bug] Webhook failing to start and somehow blocking regular Jobs not owned by JobSets from creating pods #361

[JobSet v0.3.0 Bug] Webhook failing to start and somehow blocking regular Jobs not owned by JobSets from creating pods #361

danielvegamyhre commented Dec 21, 2023

danielvegamyhre commented Dec 21, 2023

danielvegamyhre commented Dec 21, 2023 •

edited

Loading

danielvegamyhre commented Dec 21, 2023 •

edited

Loading

danielvegamyhre commented Dec 21, 2023

kannon92 commented Dec 21, 2023

danielvegamyhre commented Dec 21, 2023

danielvegamyhre commented Dec 21, 2023 •

edited

Loading

danielvegamyhre commented Dec 21, 2023 •

edited

Loading

ahg-g commented Dec 21, 2023

danielvegamyhre commented Dec 21, 2023 •

edited

Loading

[JobSet v0.3.0 Bug] Webhook failing to start and somehow blocking regular Jobs not owned by JobSets from creating pods #361

[JobSet v0.3.0 Bug] Webhook failing to start and somehow blocking regular Jobs not owned by JobSets from creating pods #361

Comments

danielvegamyhre commented Dec 21, 2023

danielvegamyhre commented Dec 21, 2023

danielvegamyhre commented Dec 21, 2023 • edited Loading

danielvegamyhre commented Dec 21, 2023 • edited Loading

danielvegamyhre commented Dec 21, 2023

kannon92 commented Dec 21, 2023

danielvegamyhre commented Dec 21, 2023

danielvegamyhre commented Dec 21, 2023 • edited Loading

danielvegamyhre commented Dec 21, 2023 • edited Loading

ahg-g commented Dec 21, 2023

danielvegamyhre commented Dec 21, 2023 • edited Loading

danielvegamyhre commented Dec 21, 2023 •

edited

Loading

danielvegamyhre commented Dec 21, 2023 •

edited

Loading

danielvegamyhre commented Dec 21, 2023 •

edited

Loading

danielvegamyhre commented Dec 21, 2023 •

edited

Loading

danielvegamyhre commented Dec 21, 2023 •

edited

Loading