Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JobSet v0.3.0 Bug] Webhook failing to start and somehow blocking regular Jobs not owned by JobSets from creating pods #361

Closed
danielvegamyhre opened this issue Dec 21, 2023 · 10 comments · Fixed by #362
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@danielvegamyhre
Copy link
Contributor

This problem doesn't happen consistently, as I manually tested JobSet v0.3.0 multiple times successfully. However, a user today is running into a concerning error. After installing v0.3.0, the JobSet deployment manager fails to start up, then somehow when they submit a regular Job (not owned by a JobSet), the Job cannot create any pods due to an error calling the JobSet webhook (??).

Steps to reproduce:

  1. Deploy JobSet v0.3.0

Jobset controller manager not starting properly, no endpoints for webhook service:

kubectl describe rs -n jobset-system

...
Events:
  Type     Reason        Age                  From                   Message
  ----     ------        ----                 ----                   -------
  Warning  FailedCreate  8m57s (x47 over 8h)  replicaset-controller  Error creating: Internal error occurred: failed calling webhook "mpod.kb.io": failed to call webhook: Post "https://jobset-webhook-service.jobset-system.svc:443/mutate--v1-pod?timeout=10s": no endpoints available for service "jobset-webhook-service"
  1. Create a standalone Job. Example Job + headless service spec being deployed by the user encountering this issue:
apiVersion: v1
kind: Service
metadata:
  name: headless-svc
spec:
  clusterIP: None
  selector:
    job-name: tpu-job
---
apiVersion: batch/v1
kind: Job
metadata:
  name: tpu-job
spec:
  backoffLimit: 0
  # Completions and parallelism should be the number of chips divided by 4.
  # (e.g. 4 for a v5litepod-16)
  completions: 4
  parallelism: 4
  completionMode: Indexed
  template:
    spec:
      subdomain: headless-svc
      restartPolicy: Never
      terminationGracePeriodSeconds: 300
      nodeSelector:
        cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
        cloud.google.com/gke-tpu-topology: 2x2x1
      containers:
      - name: tpu-job
        image: python:3.10
        ports:
        - containerPort: 8471 # Default port using which TPU VMs communicate
        - containerPort: 8431 # Port to export TPU usage metrics, if supported
        securityContext:
          privileged: true
        command:
        - bash
        - -c
        - |
          printenv
          pip install "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
          python -c 'import jax; print("JAX Devices:", jax.devices(), "Global device count:", jax.device_count(), "Local device count:", jax.local_device_count())'
        resources:
          requests:
            google.com/tpu: 4
          limits:
            google.com/tpu: 4

Describing the Job after deploying it:

kubectl describe job tpu-job


Name:               tpu-job
Namespace:          default
Selector:           batch.kubernetes.io/controller-uid=d0adcb30-8b41-4122-9751-2bb502f845d9
Labels:             batch.kubernetes.io/controller-uid=d0adcb30-8b41-4122-9751-2bb502f845d9
                    batch.kubernetes.io/job-name=tpu-job
                    controller-uid=d0adcb30-8b41-4122-9751-2bb502f845d9
                    job-name=tpu-job
Annotations:        <none>
Parallelism:        4
Completions:        4
Completion Mode:    Indexed
Start Time:         Thu, 21 Dec 2023 17:31:55 +0000
Pods Statuses:      0 Active (0 Ready) / 0 Succeeded / 0 Failed
Completed Indexes:  <none>
Pod Template:
  Labels:  batch.kubernetes.io/controller-uid=d0adcb30-8b41-4122-9751-2bb502f845d9
           batch.kubernetes.io/job-name=tpu-job
           controller-uid=d0adcb30-8b41-4122-9751-2bb502f845d9
           job-name=tpu-job
  Containers:
   tpu-job:
    Image:       python:3.10
    Ports:       8471/TCP, 8431/TCP
    Host Ports:  0/TCP, 0/TCP
    Command:
      bash
      -c
      printenv
      pip install "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
      python -c 'import jax; print("JAX Devices:", jax.devices(), "Global device count:", jax.device_count(), "Local device count:", jax.local_device_count())'
      
    Limits:
      google.com/tpu:  4
    Requests:
      google.com/tpu:  4
    Environment:
      TPU_WORKER_HOSTNAMES:  tpu-job-0.headless-svc,tpu-job-1.headless-svc,tpu-job-2.headless-svc,tpu-job-3.headless-svc
      TPU_WORKER_ID:          (v1:metadata.annotations['batch.kubernetes.io/job-completion-index'])
    Mounts:                  <none>
  Volumes:                   <none>
Events:
  Type     Reason        Age                From            Message
  ----     ------        ----               ----            -------
  Warning  FailedCreate  30s (x6 over 61s)  job-controller  Error creating: Internal error occurred: failed calling webhook "mpod.kb.io": failed to call webhook: Post "https://jobset-webhook-service.jobset-system.svc:443/mutate--v1-pod?timeout=10s": no endpoints available for service "jobset-webhook-service"
$:~/go/src/sigs.k8s.io/jobset$ k get pods -n jobset-system
@danielvegamyhre
Copy link
Contributor Author

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 21, 2023
@danielvegamyhre danielvegamyhre self-assigned this Dec 21, 2023
@danielvegamyhre
Copy link
Contributor Author

danielvegamyhre commented Dec 21, 2023

I think I see the issue. The mutating pod webhook is selecting all pods, and has no constraints on them being owned by a JobSet. We need to update it to add that constraint.

For(&corev1.Pod{}).

@danielvegamyhre
Copy link
Contributor Author

danielvegamyhre commented Dec 21, 2023

Hmm the thing is, the pod webhook has logic which skips pods that aren't owned by a JobSet.

if _, isJobSetPod := pod.Annotations[jobset.JobSetNameKey]; !isJobSetPod {

It's just when defining a webhook you have to declare the object type it's for, which is pod in this case.

So if the JobSet controller manager can't find a node to run on, but all pods in the cluster being created are trying to go through that webhook, this can happen?

@danielvegamyhre
Copy link
Contributor Author

I think we may be able to add filters to the webhook NewWebhookManagedBy() similar to what we do in the controllers, to avoid this issue... I'll look into this

@kannon92
Copy link
Contributor

This sounds like the issues @dejanzele and I have been seeing.

I think something is up with the cert-rotator we use with newer versions of kind. I think its related to the flakiness in the e2e tests.

So sometimes we are not able to deploy the jobset deployment and I see failures in kube-api-controller-manager around the webhooks for the pods. As you stated, the deployment is stuck and you see failures to the pod webhook.

In our e2e tests you can see errors like this:

2023-12-21T12:39:30.271766422Z stderr F I1221 12:39:30.271589       1 event.go:376] "Event occurred" object="jobset-system/jobset-controller-manager" fieldPath="" kind="Deployment" apiVersion="apps/v1" type="Normal" reason="ScalingReplicaSet" message="Scaled up replica set jobset-controller-manager-6ff5cc5557 to 1"
2023-12-21T12:39:31.328311043Z stderr F I1221 12:39:31.328106       1 event.go:376] "Event occurred" object="jobset-system/jobset-controller-manager-6ff5cc5557" fieldPath="" kind="ReplicaSet" apiVersion="apps/v1" type="Warning" reason="FailedCreate" message="Error creating: Internal error occurred: failed calling webhook \"mpod.kb.io\": failed to call webhook: Post \"https://jobset-webhook-service.jobset-system.svc:443/mutate--v1-pod?timeout=10s\": dial tcp 10.96.98.174:443: connect: connection refused"

See https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_jobset/359/pull-jobset-test-e2e-main-1-29/1737813742327959552/artifacts/kind-control-plane/pods/kube-system_kube-controller-manager-kind-control-plane_48fd15112471393d2f455b3c36475666/kube-controller-manager/0.log

@danielvegamyhre
Copy link
Contributor Author

The pod webhook is selecting all pods, and has no constraints on them being owned by a JobSet in the webhook definition itself. We only filter pods not owned by a jobset in the Go code itself, in the admission stage of the webhook. So if the webhook is installed but unable to run anywhere, this will block all pods in the cluster from being created.

@danielvegamyhre
Copy link
Contributor Author

danielvegamyhre commented Dec 21, 2023

We need to be able to add an objectSelector to the webhook to select pods with certain labels (indicating they are part of a JobSet), but kubebuilder annotations do not support this: https://book.kubebuilder.io/reference/markers/webhook

@danielvegamyhre
Copy link
Contributor Author

danielvegamyhre commented Dec 21, 2023

@kannon92 @ahg-g since kubebuilder markers doesn't support objectSelectors, the only way I can think of to add them into the manifest generated from the markers is to run a python script which injects it as a build step. What do you think?

Actually maybe there's a way to do it with kustomize?

@ahg-g
Copy link
Contributor

ahg-g commented Dec 21, 2023

Yes, I totally forgot about the objectSelector which we used to add manually during testing. We absolutely need to add it to avoid making the jobset operator a single point of failure for the whole cluster.

@danielvegamyhre
Copy link
Contributor Author

danielvegamyhre commented Dec 21, 2023

I asked in the kubebuilder slack channel it sounds like there may be a way to inject the objectSelector using a kubebuilder JSON Patch. I'm looking into it now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
4 participants