Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job validation bug occurs in JobSet but not standalone Job: podFailurePolicy's "onPodConditions" field not treated as optional #377

Open
danielvegamyhre opened this issue Jan 17, 2024 · 5 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug.

Comments

@danielvegamyhre
Copy link
Contributor

danielvegamyhre commented Jan 17, 2024

A while ago we fixed this bug causing validation failure when the podFailurePolicy onPodConditions field is unset, even though it is an optional field: kubernetes/kubernetes#120204. This change was cherry picked into 1.27 and my understanding is that using 1.27.7+ should allow me to use this change.

Despite using 1.27.9 for my GKE cluster master version, and v0.28.5 for k8s Go packages in JobSet, I get the following error when onPodConditions is unset:

~/go/src/sigs.k8s.io/jobset$ k apply -f examples/simple/configurable-failure-policy.yaml 
The JobSet "configurable-failure-policy" is invalid: 
* spec.replicatedJobs[0].template.spec.podFailurePolicy.rules[0].onPodConditions: Invalid value: "null": spec.replicatedJobs[0].template.spec.podFailurePolicy.rules[0].onPodConditions in body must be of type array: "null"
* spec.replicatedJobs[1].template.spec.podFailurePolicy.rules[0].onPodConditions: Invalid value: "null": spec.replicatedJobs[1].template.spec.podFailurePolicy.rules[0].onPodConditions in body must be of type array: "null"
* <nil>: Invalid value: "null": some validation rules were not checked because the object was invalid; correct the existing errors to complete validation

Here is the JobSet yaml:

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: configurable-failure-policy
  annotations:
    alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool # 1:1 job replica to node pool assignment
spec:
  failurePolicy:
    rules:
    - action: FailJobSet
      onJobFailureReasons: 
      - PodFailurePolicy
    maxRestarts: 3
  replicatedJobs:
  - name: buggy-job 
    replicas: 1 # set to number of node pools
    template:
      spec:
        parallelism: 1 # number of nodes per pool
        completions: 1 # number of nodes per pool
        backoffLimit: 0
        # fail job immediately if job was not killed by SIGTERM (graceful node shutdown for maintenance events)
        podFailurePolicy:
          rules:
          - action: FailJob
            onExitCodes:
              containerName: main
              operator: NotIn
              values: [143] # SIGTERM = exit code 143
            onPodConditions: [] # Note this field must be specified as an empty list, even if unused
        template:
          spec:
            restartPolicy: Never
            containers:
            - name: main
              image: bash:latest
              image: docker.io/library/bash:5
              command: ["bash"]
              args:
              - -c
              - echo "Hello world! I'm going to exit with exit code 1 to simulate a software bug." && sleep 20 && exit 1
  - name: normal-job
    replicas: 1 # set to number of node pools
    template:
      spec:
        parallelism: 1 # number of nodes per pool
        completions: 1 # number of nodes per pool
        backoffLimit: 0
        # fail job immediately if job was not killed by SIGTERM (graceful node shutdown for maintenance events)
        podFailurePolicy:
          rules:
          - action: FailJob
            onExitCodes:
              containerName: main
              operator: NotIn
              values: [143] # SIGTERM = exit code 143
            onPodConditions: [] # Note this field must be specified as an empty list, even if unused
        template:
          spec:
            restartPolicy: Never
            containers:
            - name: main
              image: bash:latest
              image: docker.io/library/bash:5
              command: ["bash"]
              args:
              - -c
              - echo "Hello world! I'm going to exit with SIGTERM (exit code) to simulate host maintenace." && sleep 5 && exit 143

However, before the change, even if I set onPodConditions: [], validation still failed. Now if I set onPodConditions: [] the validation passes and things work as expected.

To investigate this further, I tried creating a standalone Job with the same podFailurePolicy, which worked:

apiVersion: batch/v1
kind: Job
metadata:
  name: pi
spec:
  podFailurePolicy:
    rules:
    - action: FailJob
      onExitCodes:
        containerName: pi
        operator: NotIn
        values: [143] # SIGTERM = exit code 143
  template:
    spec:
      containers:
      - name: pi
        image: perl:5.34.0
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
      restartPolicy: Never
  backoffLimit: 4
 

This leads me to believe the issue is specific to JobSet / the versions of the k8s Go packages we are using, so I am opening this bug.

/kind bug

@danielvegamyhre
Copy link
Contributor Author

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 17, 2024
@kannon92
Copy link
Contributor

I don't this is correct.

Despite using 1.27.9 for my GKE cluster master version, and v0.27.7 for k8s Go packages in JobSet, I get the following error when onPodConditions is unset:

The go.mod/go.sum seems to point to 0.28.4 rather than this one.

@danielvegamyhre
Copy link
Contributor Author

I don't this is correct.

Despite using 1.27.9 for my GKE cluster master version, and v0.27.7 for k8s Go packages in JobSet, I get the following error when onPodConditions is unset:

The go.mod/go.sum seems to point to 0.28.4 rather than this one.

You're right, I was looking at #301 when writing this. I'll update the description accordingly.

@kannon92
Copy link
Contributor

@mimowo we still see that this is a problem. It looks fixed when using normal jobs but it still seems present for JobSet

@danielvegamyhre danielvegamyhre added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Feb 16, 2024
@mimowo
Copy link
Contributor

mimowo commented Jul 31, 2024

Bump, this is being addressed by bumping to the fixed k8s API in kubernetes/kubernetes#126046 (1.29.8 and 1.30.4 will include the fix).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants