Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for 1:1 job per topology assignment #36

Merged
merged 12 commits into from
Apr 18, 2023

Conversation

danielvegamyhre
Copy link
Contributor

@danielvegamyhre danielvegamyhre commented Apr 17, 2023

Fixes #27

Adds Exclusive struct to the API which has a TopologyKey field. If specified, the JobSet controller will append pod affinity and anti-affinity terms to each job pod template spec, which ensures 1:1 job per topology assignment, and ensures all pods for each job land on the same topology.

Note: I also had to update the make install command in the Makefile to use kubectl create instead of kubectl apply as a workaround for this issue.

Did a manual test on a cluster with 3 node pools, using the topology key cloud.google.com/gke-nodepool to place 1 job per node pool, and it worked.

Test JobSet with 3 replicated jobs, 1 replica each:

apiVersion: batch.x-k8s.io/v1alpha1
kind: JobSet
metadata:
  name: jobpertopology
  labels:
    app.kubernetes.io/name: jobset
    app.kubernetes.io/instance: jobset-example
    app.kubernetes.io/part-of: project
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/created-by: project
spec:
  jobs:
  - name: worker0
    network: 
      enableDNSHostnames: true
    exclusive:
      topologyKey: cloud.google.com/gke-nodepool
    template:
      metadata:
        name: "worker1"
      spec:
        completionMode: Indexed
        parallelism: 2
        completions: 2
        template:
          spec:
            restartPolicy: Never
            containers:
            - name: busybox-short-sleep
              image: busybox:latest
              command:
              - sleep
              - "1000" 
  - name: worker1
    network: 
      enableDNSHostnames: true
    exclusive:
      topologyKey: cloud.google.com/gke-nodepool
    template:
      metadata:
        name: "worker1"
      spec:
        completionMode: Indexed
        parallelism: 2
        completions: 2
        template:
          spec:
            restartPolicy: Never
            containers:
            - name: busybox-short-sleep
              image: busybox:latest
              command:
              - sleep
              - "1000" 
  - name: worker2
    network: 
      enableDNSHostnames: true
    exclusive:
      topologyKey: cloud.google.com/gke-nodepool
    template:
      metadata:
        name: "worker2"
      spec:
        completionMode: Indexed
        parallelism: 2
        completions: 2
        template:
          spec:
            restartPolicy: Never
            containers:
            - name: busybox-short-sleep
              image: busybox:latest
              command:
              - sleep
              - "1000" 

Pod placement:

$ k get pods -o wide
NAME                               READY   STATUS    RESTARTS   AGE   IP         NODE                                     NOMINATED NODE   READINESS GATES
jobpertopology-worker0-0-0-6tg7d   1/1     Running   0          7s    10.8.7.3   gke-jobset4-np2-782cc200-vxq8            <none>           <none>
jobpertopology-worker0-0-1-262vf   1/1     Running   0          7s    10.8.8.3   gke-jobset4-np2-782cc200-946b            <none>           <none>
jobpertopology-worker1-0-0-lc2q6   1/1     Running   0          7s    10.8.5.4   gke-jobset4-np1-004ddd96-58v7            <none>           <none>
jobpertopology-worker1-0-1-4wgcz   1/1     Running   0          7s    10.8.3.4   gke-jobset4-np1-004ddd96-cvzx            <none>           <none>
jobpertopology-worker2-0-0-849c6   1/1     Running   0          7s    10.8.1.9   gke-jobset4-default-pool-238f9ca7-8s4k   <none>           <none>
jobpertopology-worker2-0-1-mdlqv   1/1     Running   0          7s    10.8.2.6   gke-jobset4-default-pool-238f9ca7-hc67   <none>           <none>

Test JobSet with 1 replicated job with 3 replicas:

apiVersion: batch.x-k8s.io/v1alpha1
kind: JobSet
metadata:
  name: jobpertopology
  labels:
    app.kubernetes.io/name: jobset
    app.kubernetes.io/instance: jobset-example
    app.kubernetes.io/part-of: project
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/created-by: project
spec:
  jobs:
  - name: worker
    network: 
      enableDNSHostnames: true
    exclusive:
      topologyKey: cloud.google.com/gke-nodepool
    replicas: 3
    template:
      metadata:
        name: "worker"
      spec:
        completionMode: Indexed
        parallelism: 2
        completions: 2
        template:
          spec:
            restartPolicy: Never
            containers:
            - name: busybox-short-sleep
              image: busybox:latest
              command:
              - sleep
              - "1000"

Pod placement:

$ k get pods -o wide
NAME                              READY   STATUS    RESTARTS   AGE   IP          NODE                                     NOMINATED NODE   READINESS GATES
jobpertopology-worker-0-0-tdbfp   1/1     Running   0          11s   10.8.8.4    gke-jobset4-np2-782cc200-946b            <none>           <none>
jobpertopology-worker-0-1-rzrnb   1/1     Running   0          11s   10.8.7.4    gke-jobset4-np2-782cc200-vxq8            <none>           <none>
jobpertopology-worker-1-0-pkqgt   1/1     Running   0          11s   10.8.4.4    gke-jobset4-np1-004ddd96-c0b6            <none>           <none>
jobpertopology-worker-1-1-fdgtn   1/1     Running   0          11s   10.8.3.5    gke-jobset4-np1-004ddd96-cvzx            <none>           <none>
jobpertopology-worker-2-0-q7wbr   1/1     Running   0          11s   10.8.1.10   gke-jobset4-default-pool-238f9ca7-8s4k   <none>           <none>
jobpertopology-worker-2-1-kp4n7   1/1     Running   0          11s   10.8.2.7    gke-jobset4-default-pool-238f9ca7-hc67   <none>           <none>

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danielvegamyhre

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 17, 2023
@k8s-ci-robot k8s-ci-robot requested a review from ahg-g April 17, 2023 20:15
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 17, 2023
api/v1alpha1/jobset_types.go Outdated Show resolved Hide resolved
api/v1alpha1/jobset_types.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
api/v1alpha1/jobset_types.go Show resolved Hide resolved
pkg/controllers/jobset_controller_test.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
danielvegamyhre and others added 3 commits April 17, 2023 18:46
Co-authored-by: Abdullah Gharaibeh <40361897+ahg-g@users.noreply.github.com>
Co-authored-by: Abdullah Gharaibeh <40361897+ahg-g@users.noreply.github.com>
Copy link
Contributor

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few nits

pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller_test.go Outdated Show resolved Hide resolved
api/v1alpha1/jobset_types.go Outdated Show resolved Hide resolved
danielvegamyhre and others added 6 commits April 17, 2023 21:12
Co-authored-by: Abdullah Gharaibeh <40361897+ahg-g@users.noreply.github.com>
Co-authored-by: Abdullah Gharaibeh <40361897+ahg-g@users.noreply.github.com>
Co-authored-by: Abdullah Gharaibeh <40361897+ahg-g@users.noreply.github.com>
Co-authored-by: Abdullah Gharaibeh <40361897+ahg-g@users.noreply.github.com>
Co-authored-by: Abdullah Gharaibeh <40361897+ahg-g@users.noreply.github.com>
@ahg-g
Copy link
Contributor

ahg-g commented Apr 18, 2023

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Apr 18, 2023
@ahg-g
Copy link
Contributor

ahg-g commented Apr 18, 2023

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 18, 2023
@k8s-ci-robot k8s-ci-robot merged commit 691b0ad into kubernetes-sigs:main Apr 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1:1 job to topology assignment
3 participants