Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify quota changes needed for gpu jobs, create pool of gpu projects #1095

Closed
6 tasks done
spiffxp opened this issue Jul 31, 2020 · 16 comments
Closed
6 tasks done
Assignees
Labels
area/prow Setting up or working with prow in general, prow.k8s.io, prow build clusters sig/release Categorizes an issue or PR as relevant to SIG Release. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@spiffxp
Copy link
Member

spiffxp commented Jul 31, 2020

This is like #851 but for gpu

The default quotas for an e2e project (eg: k8s-infra-e2e-gce-project) MAY be insufficient to run ci-kubernetes-e2e-gce-device-plugin-gpu

Currently this job runs in the google.com k8s-prow-builds cluster, using a project from that boskos' gpu-project pool

@spiffxp
Copy link
Member Author

spiffxp commented Aug 4, 2020

/assign

@spiffxp
Copy link
Member Author

spiffxp commented Aug 4, 2020

$ for prj in k8s-gke-gpu-boskos-02 k8s-infra-e2e-gpu-project; do
  gcloud compute project-info describe --project=$prj > $prj.compute-project-info.yaml
done
$ diff -yW80 {k8s-infra-e2e-gpu-project,k8s-gke-gpu-boskos-02}.compute-project-info.yaml | grep -A3 "|.*limit"
- limit: 100.0			      |	- limit: 200.0
  metric: SECURITY_POLICY_RULES		  metric: SECURITY_POLICY_RULES
  usage: 0.0				  usage: 0.0
- limit: 45.0				- limit: 45.0
--
- limit: 300.0			      |	- limit: 1000.0
  metric: NETWORK_ENDPOINT_GROUPS	  metric: NETWORK_ENDPOINT_GROUPS
  usage: 0.0				  usage: 0.0
- limit: 6.0				- limit: 6.0
--
- limit: 15.0			      |	- limit: 50.0
  metric: EXTERNAL_VPN_GATEWAYS		  metric: EXTERNAL_VPN_GATEWAYS
  usage: 0.0				  usage: 0.0
- limit: 1.0				- limit: 1.0
--
- limit: 1024.0			      |	- limit: 128.0
  metric: STATIC_BYOIP_ADDRESSES	  metric: STATIC_BYOIP_ADDRESSES
  usage: 0.0				  usage: 0.0

... none of these are GPU related probably need to look for something zone/region-specific

@spiffxp
Copy link
Member Author

spiffxp commented Aug 4, 2020

$ diff -yW80 {k8s-infra-e2e-gpu-project,k8s-gke-gpu-boskos-02}.us-west1-describe.yaml | grep -A3 "|.*limit"
- limit: 1.0			      |	- limit: 8.0
  metric: NVIDIA_P100_GPUS		  metric: NVIDIA_P100_GPUS
  usage: 0.0				  usage: 0.0
- limit: 0.0				- limit: 0.0
--
- limit: 1.0			      |	- limit: 0.0
  metric: PREEMPTIBLE_NVIDIA_K80_GPUS	  metric: PREEMPTIBLE_NVIDIA_K80_GPUS
  usage: 0.0				  usage: 0.0
- limit: 1.0			      |	- limit: 0.0
  metric: PREEMPTIBLE_NVIDIA_P100_GPU	  metric: PREEMPTIBLE_NVIDIA_P100_GPU
  usage: 0.0				  usage: 0.0
- limit: 1.0				- limit: 1.0
--
- limit: 50.0			      |	- limit: 500.0
  metric: IN_USE_SNAPSHOT_SCHEDULES	  metric: IN_USE_SNAPSHOT_SCHEDULES
  usage: 0.0				  usage: 0.0
- limit: 4.0				- limit: 4.0
--
- limit: 50.0			      |	- limit: 20.0
  metric: IN_USE_BACKUP_SCHEDULES	  metric: IN_USE_BACKUP_SCHEDULES
  usage: 0.0				  usage: 0.0
- limit: 10.0				- limit: 10.0
--
- limit: 0.0			      |	- limit: 16.0
  metric: COMMITTED_NVIDIA_K80_GPUS	  metric: COMMITTED_NVIDIA_K80_GPUS
  usage: 0.0				  usage: 0.0
- limit: 0.0			      |	- limit: 1.0
  metric: COMMITTED_NVIDIA_P100_GPUS	  metric: COMMITTED_NVIDIA_P100_GPUS
  usage: 0.0				  usage: 0.0
- limit: 0.0			      |	- limit: 1.0
  metric: COMMITTED_NVIDIA_P4_GPUS	  metric: COMMITTED_NVIDIA_P4_GPUS
  usage: 0.0				  usage: 0.0
- limit: 0.0			      |	- limit: 8.0
  metric: COMMITTED_NVIDIA_V100_GPUS	  metric: COMMITTED_NVIDIA_V100_GPUS
  usage: 0.0				  usage: 0.0
- limit: 0.0			      |	- limit: 8.0
  metric: COMMITTED_NVIDIA_T4_GPUS	  metric: COMMITTED_NVIDIA_T4_GPUS
  usage: 0.0				  usage: 0.0
- limit: 24.0				- limit: 24.0
--
- limit: 0.0			      |	- limit: 24.0
  metric: COMMITTED_N2_CPUS		  metric: COMMITTED_N2_CPUS
  usage: 0.0				  usage: 0.0
- limit: 0.0			      |	- limit: 24.0
  metric: COMMITTED_C2_CPUS		  metric: COMMITTED_C2_CPUS
  usage: 0.0				  usage: 0.0
- limit: 100.0			      |	- limit: 2000.0
  metric: RESERVATIONS			  metric: RESERVATIONS
  usage: 0.0				  usage: 0.0
- limit: 0.0			      |	- limit: 150.0
  metric: COMMITTED_LICENSES		  metric: COMMITTED_LICENSES
  usage: 0.0				  usage: 0.0
- limit: 24.0			      |	- limit: 0.0
  metric: N2D_CPUS			  metric: N2D_CPUS
  usage: 0.0				  usage: 0.0
- limit: 0.0			      |	- limit: 24.0
  metric: COMMITTED_N2D_CPUS		  metric: COMMITTED_N2D_CPUS
  usage: 0.0				  usage: 0.0
- limit: 1024.0			      |	- limit: 128.0
  metric: STATIC_BYOIP_ADDRESSES	  metric: STATIC_BYOIP_ADDRESSES
  usage: 0.0				  usage: 0.0
- limit: 3.0			      |	- limit: 10.0
  metric: AFFINITY_GROUPS		  metric: AFFINITY_GROUPS
  usage: 0.0				  usage: 0.0
- limit: 1.0				- limit: 1.0
--
- limit: 16.0			      |	- limit: 64.0
  metric: PREEMPTIBLE_NVIDIA_A100_GPU	  metric: PREEMPTIBLE_NVIDIA_A100_GPU
  usage: 0.0				  usage: 0.0
- limit: 0.0			      |	- limit: 16.0
  metric: COMMITTED_NVIDIA_A100_GPUS	  metric: COMMITTED_NVIDIA_A100_GPUS
  usage: 0.0				  usage: 0.0
- limit: 12.0				- limit: 12.0
--
- limit: 0.0			      |	- limit: 192.0
  metric: COMMITTED_A2_CPUS		  metric: COMMITTED_A2_CPUS
  usage: 0.0				  usage: 0.0

@spiffxp
Copy link
Member Author

spiffxp commented Aug 4, 2020

Jobs that use this project type:

  • ci-cri-containerd-e2e-gce-device-plugin-gpu
  • ci-kubernetes-e2e-gce-device-plugin-gpu
  • ci-kubernetes-e2e-gce-device-plugin-gpu-beta
  • ci-kubernetes-e2e-gce-device-plugin-gpu-stable1
  • ci-kubernetes-e2e-gce-device-plugin-gpu-stable2
  • ci-kubernetes-e2e-gce-device-plugin-gpu-stable3
  • ci-kubernetes-e2e-gce-gpu-beta-stable1-cluster-downgrade
  • ci-kubernetes-e2e-gce-gpu-master-stable1-cluster-downgrade
  • ci-kubernetes-e2e-gce-gpu-stable1-beta-cluster-upgrade
  • ci-kubernetes-e2e-gce-gpu-stable1-beta-master-upgrade
  • ci-kubernetes-e2e-gce-gpu-stable1-master-cluster-upgrade
  • ci-kubernetes-e2e-gce-gpu-stable1-master-master-upgrade
  • ci-kubernetes-e2e-gce-gpu-stable2-stable1-cluster-upgrade
  • ci-kubernetes-e2e-gce-gpu-stable2-stable1-master-upgrade

@spiffxp
Copy link
Member Author

spiffxp commented Aug 4, 2020

Will use canary job in kubernetes/test-infra#18664 to verify whether quota works

I'm also noticing this fun preset appears to be involved:

presets:
- labels:
    preset-ci-gce-device-plugin-gpu: "true"
  env:
  - name: KUBE_GCE_NODE_IMAGE
    value: gke-1134-gke-rc5-cos-69-10895-138-0-v190320-pre-nvda-gpu
  - name: KUBE_GCE_NODE_PROJECT
    value: gke-node-images
  - name: NODE_ACCELERATORS
    value: type=nvidia-tesla-k80,count=2

Not sure how it's used, but that may present other complications

@spiffxp
Copy link
Member Author

spiffxp commented Aug 4, 2020

Based on the preset I'm going to assume this is the quota we should pay attention to. I'm less clear about the rest

- limit: 0.0			      |	- limit: 16.0
  metric: COMMITTED_NVIDIA_K80_GPUS	  metric: COMMITTED_NVIDIA_K80_GPUS
  usage: 0.0				  usage: 0.0

@spiffxp
Copy link
Member Author

spiffxp commented Aug 4, 2020

I went ahead and submitted a request for Committed NVIDIA K80 GPUs, us-west1: 0->2

@spiffxp
Copy link
Member Author

spiffxp commented Aug 6, 2020

That was enough to get https://testgrid.k8s.io/sig-testing-canaries#gce-device-plugin-gpu to pass

The existing gpu-project pool is 15 projects and peaks at 5 projects

@spiffxp
Copy link
Member Author

spiffxp commented Aug 6, 2020

Ah! Fun fact: pull-kubernetes-e2e-gce-device-plugin-gpu is pinned to a single project k8s-jkns-pr-gce-gpus. So, 15 projects may not quite be enough.

@spiffxp
Copy link
Member Author

spiffxp commented Aug 7, 2020

kubernetes/test-infra#18728 - demoted pull-kubernetes-e2e-gce-device-plugin-gpu from merge-blocking, now it's manually triggered with max_concurrency 5

@spiffxp
Copy link
Member Author

spiffxp commented Aug 7, 2020

Now filling out quota requests for 10 projects...

@spiffxp
Copy link
Member Author

spiffxp commented Aug 7, 2020

... which were small enough to be automatically approved

@spiffxp
Copy link
Member Author

spiffxp commented Aug 7, 2020

Opened #1125 to add the projects to k8s-infra-prow-build's boskos as a new gpu-project pool

@spiffxp
Copy link
Member Author

spiffxp commented Aug 7, 2020

@spiffxp
Copy link
Member Author

spiffxp commented Aug 8, 2020

/close
Calling this done

@k8s-ci-robot
Copy link
Contributor

@spiffxp: Closing this issue.

In response to this:

/close
Calling this done

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/prow Setting up or working with prow in general, prow.k8s.io, prow build clusters sig/release Categorizes an issue or PR as relevant to SIG Release. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

2 participants