Identify quota changes needed for gpu jobs, create pool of gpu projects #1095

spiffxp · 2020-07-31T21:00:45Z

This is like #851 but for gpu

The default quotas for an e2e project (eg: k8s-infra-e2e-gce-project) MAY be insufficient to run ci-kubernetes-e2e-gce-device-plugin-gpu

Currently this job runs in the google.com k8s-prow-builds cluster, using a project from that boskos' gpu-project pool

Identify whether ci-kubernetes-e2e-gce-device-plugin-gpu can run against a stock e2e project: it couldn't, needed raised quota
Identify what quotas are set that make gpu-project pool projects differ from a default e2e project: raised committed k80 gpu's in us-west1 from 0 to 2
Identify what jobs this will allow us to migrate: jobs mentioned in Identify quota changes needed for gpu jobs, create pool of gpu projects #1095 (comment) plus pull-kubernetes-e2e-gce-device-plugin-gpu
Identify pool size: went with 10 (reasoning in Provision 10 gpu GCP projects #1118)
Provision pool in kubernetes.io org (done via: Provision 10 gpu GCP projects #1118)
Add pool to k8s-infra-prow-build's boskos (done via: Add gpu-project pool to prow-build boskos #1125)

spiffxp · 2020-08-04T22:56:56Z

/assign

spiffxp · 2020-08-04T22:57:00Z

$ for prj in k8s-gke-gpu-boskos-02 k8s-infra-e2e-gpu-project; do
  gcloud compute project-info describe --project=$prj > $prj.compute-project-info.yaml
done
$ diff -yW80 {k8s-infra-e2e-gpu-project,k8s-gke-gpu-boskos-02}.compute-project-info.yaml | grep -A3 "|.*limit"
- limit: 100.0			      |	- limit: 200.0
  metric: SECURITY_POLICY_RULES		  metric: SECURITY_POLICY_RULES
  usage: 0.0				  usage: 0.0
- limit: 45.0				- limit: 45.0
--
- limit: 300.0			      |	- limit: 1000.0
  metric: NETWORK_ENDPOINT_GROUPS	  metric: NETWORK_ENDPOINT_GROUPS
  usage: 0.0				  usage: 0.0
- limit: 6.0				- limit: 6.0
--
- limit: 15.0			      |	- limit: 50.0
  metric: EXTERNAL_VPN_GATEWAYS		  metric: EXTERNAL_VPN_GATEWAYS
  usage: 0.0				  usage: 0.0
- limit: 1.0				- limit: 1.0
--
- limit: 1024.0			      |	- limit: 128.0
  metric: STATIC_BYOIP_ADDRESSES	  metric: STATIC_BYOIP_ADDRESSES
  usage: 0.0				  usage: 0.0

... none of these are GPU related probably need to look for something zone/region-specific

spiffxp · 2020-08-04T22:57:07Z

$ diff -yW80 {k8s-infra-e2e-gpu-project,k8s-gke-gpu-boskos-02}.us-west1-describe.yaml | grep -A3 "|.*limit"
- limit: 1.0			      |	- limit: 8.0
  metric: NVIDIA_P100_GPUS		  metric: NVIDIA_P100_GPUS
  usage: 0.0				  usage: 0.0
- limit: 0.0				- limit: 0.0
--
- limit: 1.0			      |	- limit: 0.0
  metric: PREEMPTIBLE_NVIDIA_K80_GPUS	  metric: PREEMPTIBLE_NVIDIA_K80_GPUS
  usage: 0.0				  usage: 0.0
- limit: 1.0			      |	- limit: 0.0
  metric: PREEMPTIBLE_NVIDIA_P100_GPU	  metric: PREEMPTIBLE_NVIDIA_P100_GPU
  usage: 0.0				  usage: 0.0
- limit: 1.0				- limit: 1.0
--
- limit: 50.0			      |	- limit: 500.0
  metric: IN_USE_SNAPSHOT_SCHEDULES	  metric: IN_USE_SNAPSHOT_SCHEDULES
  usage: 0.0				  usage: 0.0
- limit: 4.0				- limit: 4.0
--
- limit: 50.0			      |	- limit: 20.0
  metric: IN_USE_BACKUP_SCHEDULES	  metric: IN_USE_BACKUP_SCHEDULES
  usage: 0.0				  usage: 0.0
- limit: 10.0				- limit: 10.0
--
- limit: 0.0			      |	- limit: 16.0
  metric: COMMITTED_NVIDIA_K80_GPUS	  metric: COMMITTED_NVIDIA_K80_GPUS
  usage: 0.0				  usage: 0.0
- limit: 0.0			      |	- limit: 1.0
  metric: COMMITTED_NVIDIA_P100_GPUS	  metric: COMMITTED_NVIDIA_P100_GPUS
  usage: 0.0				  usage: 0.0
- limit: 0.0			      |	- limit: 1.0
  metric: COMMITTED_NVIDIA_P4_GPUS	  metric: COMMITTED_NVIDIA_P4_GPUS
  usage: 0.0				  usage: 0.0
- limit: 0.0			      |	- limit: 8.0
  metric: COMMITTED_NVIDIA_V100_GPUS	  metric: COMMITTED_NVIDIA_V100_GPUS
  usage: 0.0				  usage: 0.0
- limit: 0.0			      |	- limit: 8.0
  metric: COMMITTED_NVIDIA_T4_GPUS	  metric: COMMITTED_NVIDIA_T4_GPUS
  usage: 0.0				  usage: 0.0
- limit: 24.0				- limit: 24.0
--
- limit: 0.0			      |	- limit: 24.0
  metric: COMMITTED_N2_CPUS		  metric: COMMITTED_N2_CPUS
  usage: 0.0				  usage: 0.0
- limit: 0.0			      |	- limit: 24.0
  metric: COMMITTED_C2_CPUS		  metric: COMMITTED_C2_CPUS
  usage: 0.0				  usage: 0.0
- limit: 100.0			      |	- limit: 2000.0
  metric: RESERVATIONS			  metric: RESERVATIONS
  usage: 0.0				  usage: 0.0
- limit: 0.0			      |	- limit: 150.0
  metric: COMMITTED_LICENSES		  metric: COMMITTED_LICENSES
  usage: 0.0				  usage: 0.0
- limit: 24.0			      |	- limit: 0.0
  metric: N2D_CPUS			  metric: N2D_CPUS
  usage: 0.0				  usage: 0.0
- limit: 0.0			      |	- limit: 24.0
  metric: COMMITTED_N2D_CPUS		  metric: COMMITTED_N2D_CPUS
  usage: 0.0				  usage: 0.0
- limit: 1024.0			      |	- limit: 128.0
  metric: STATIC_BYOIP_ADDRESSES	  metric: STATIC_BYOIP_ADDRESSES
  usage: 0.0				  usage: 0.0
- limit: 3.0			      |	- limit: 10.0
  metric: AFFINITY_GROUPS		  metric: AFFINITY_GROUPS
  usage: 0.0				  usage: 0.0
- limit: 1.0				- limit: 1.0
--
- limit: 16.0			      |	- limit: 64.0
  metric: PREEMPTIBLE_NVIDIA_A100_GPU	  metric: PREEMPTIBLE_NVIDIA_A100_GPU
  usage: 0.0				  usage: 0.0
- limit: 0.0			      |	- limit: 16.0
  metric: COMMITTED_NVIDIA_A100_GPUS	  metric: COMMITTED_NVIDIA_A100_GPUS
  usage: 0.0				  usage: 0.0
- limit: 12.0				- limit: 12.0
--
- limit: 0.0			      |	- limit: 192.0
  metric: COMMITTED_A2_CPUS		  metric: COMMITTED_A2_CPUS
  usage: 0.0				  usage: 0.0

spiffxp · 2020-08-04T22:57:39Z

Jobs that use this project type:

ci-cri-containerd-e2e-gce-device-plugin-gpu
ci-kubernetes-e2e-gce-device-plugin-gpu
ci-kubernetes-e2e-gce-device-plugin-gpu-beta
ci-kubernetes-e2e-gce-device-plugin-gpu-stable1
ci-kubernetes-e2e-gce-device-plugin-gpu-stable2
ci-kubernetes-e2e-gce-device-plugin-gpu-stable3
ci-kubernetes-e2e-gce-gpu-beta-stable1-cluster-downgrade
ci-kubernetes-e2e-gce-gpu-master-stable1-cluster-downgrade
ci-kubernetes-e2e-gce-gpu-stable1-beta-cluster-upgrade
ci-kubernetes-e2e-gce-gpu-stable1-beta-master-upgrade
ci-kubernetes-e2e-gce-gpu-stable1-master-cluster-upgrade
ci-kubernetes-e2e-gce-gpu-stable1-master-master-upgrade
ci-kubernetes-e2e-gce-gpu-stable2-stable1-cluster-upgrade
ci-kubernetes-e2e-gce-gpu-stable2-stable1-master-upgrade

spiffxp · 2020-08-04T23:04:48Z

Will use canary job in kubernetes/test-infra#18664 to verify whether quota works

I'm also noticing this fun preset appears to be involved:

presets:
- labels:
    preset-ci-gce-device-plugin-gpu: "true"
  env:
  - name: KUBE_GCE_NODE_IMAGE
    value: gke-1134-gke-rc5-cos-69-10895-138-0-v190320-pre-nvda-gpu
  - name: KUBE_GCE_NODE_PROJECT
    value: gke-node-images
  - name: NODE_ACCELERATORS
    value: type=nvidia-tesla-k80,count=2

Not sure how it's used, but that may present other complications

spiffxp · 2020-08-04T23:42:18Z

Based on the preset I'm going to assume this is the quota we should pay attention to. I'm less clear about the rest

- limit: 0.0			      |	- limit: 16.0
  metric: COMMITTED_NVIDIA_K80_GPUS	  metric: COMMITTED_NVIDIA_K80_GPUS
  usage: 0.0				  usage: 0.0

spiffxp · 2020-08-04T23:52:32Z

I went ahead and submitted a request for Committed NVIDIA K80 GPUs, us-west1: 0->2

spiffxp · 2020-08-06T21:58:29Z

That was enough to get https://testgrid.k8s.io/sig-testing-canaries#gce-device-plugin-gpu to pass

The existing gpu-project pool is 15 projects and peaks at 5 projects

spiffxp · 2020-08-06T22:14:53Z

Ah! Fun fact: pull-kubernetes-e2e-gce-device-plugin-gpu is pinned to a single project k8s-jkns-pr-gce-gpus. So, 15 projects may not quite be enough.

spiffxp · 2020-08-07T20:01:09Z

kubernetes/test-infra#18728 - demoted pull-kubernetes-e2e-gce-device-plugin-gpu from merge-blocking, now it's manually triggered with max_concurrency 5

spiffxp · 2020-08-07T20:14:37Z

Now filling out quota requests for 10 projects...

spiffxp · 2020-08-07T20:47:52Z

... which were small enough to be automatically approved

spiffxp · 2020-08-07T21:32:54Z

Opened #1125 to add the projects to k8s-infra-prow-build's boskos as a new gpu-project pool

spiffxp · 2020-08-07T21:33:28Z

kubernetes/test-infra#18740 will add the gpu-project pool to https://monitoring.prow.k8s.io/d/wSrfvNxWz/boskos-resource-usage?orgId=1

spiffxp · 2020-08-08T00:40:29Z

/close
Calling this done

k8s-ci-robot · 2020-08-08T00:40:42Z

@spiffxp: Closing this issue.

In response to this:

/close
Calling this done

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

spiffxp added sig/release Categorizes an issue or PR as relevant to SIG Release. sig/testing Categorizes an issue or PR as relevant to SIG Testing. wg/k8s-infra area/prow Setting up or working with prow in general, prow.k8s.io, prow build clusters labels Aug 1, 2020

k8s-ci-robot assigned spiffxp Aug 4, 2020

spiffxp mentioned this issue Aug 4, 2020

Add job to test out k8s-infra-e2e-gpu-project kubernetes/test-infra#18664

Merged

spiffxp mentioned this issue Aug 4, 2020

Update gpu test node image to the latest COS gpu-prebuild image that kubernetes/test-infra#11876

Merged

spiffxp mentioned this issue Aug 6, 2020

Move gpu canary job to k8s-infra-prow-build kubernetes/test-infra#18694

Merged

spiffxp mentioned this issue Aug 7, 2020

Provision 10 gpu GCP projects #1118

Merged

This was referenced Aug 7, 2020

Add gpu-project pool to prow-build boskos #1125

Merged

Add k8s-infra gpu-project pool to boskos dashboard kubernetes/test-infra#18740

Merged

spiffxp mentioned this issue Aug 8, 2020

Migrate release-blocking gpu jobs to k8s-infra kubernetes/test-infra#18744

Merged

k8s-ci-robot closed this as completed Aug 8, 2020

spiffxp mentioned this issue Aug 8, 2020

Migrate release-master-blocking jobs to k8s-infra-prow-build #841

Closed

18 tasks

spiffxp mentioned this issue Oct 6, 2020

RFC: Move boskos testing projects pool to kubernetes.io #390

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify quota changes needed for gpu jobs, create pool of gpu projects #1095

Identify quota changes needed for gpu jobs, create pool of gpu projects #1095

spiffxp commented Jul 31, 2020 •

edited

Loading

spiffxp commented Aug 4, 2020

spiffxp commented Aug 4, 2020 •

edited

Loading

spiffxp commented Aug 4, 2020 •

edited

Loading

spiffxp commented Aug 4, 2020

spiffxp commented Aug 4, 2020

spiffxp commented Aug 4, 2020

spiffxp commented Aug 4, 2020

spiffxp commented Aug 6, 2020

spiffxp commented Aug 6, 2020

spiffxp commented Aug 7, 2020

spiffxp commented Aug 7, 2020

spiffxp commented Aug 7, 2020

spiffxp commented Aug 7, 2020

spiffxp commented Aug 7, 2020

spiffxp commented Aug 8, 2020

k8s-ci-robot commented Aug 8, 2020

Identify quota changes needed for gpu jobs, create pool of gpu projects #1095

Identify quota changes needed for gpu jobs, create pool of gpu projects #1095

Comments

spiffxp commented Jul 31, 2020 • edited Loading

spiffxp commented Aug 4, 2020

spiffxp commented Aug 4, 2020 • edited Loading

spiffxp commented Aug 4, 2020 • edited Loading

spiffxp commented Aug 4, 2020

spiffxp commented Aug 4, 2020

spiffxp commented Aug 4, 2020

spiffxp commented Aug 4, 2020

spiffxp commented Aug 6, 2020

spiffxp commented Aug 6, 2020

spiffxp commented Aug 7, 2020

spiffxp commented Aug 7, 2020

spiffxp commented Aug 7, 2020

spiffxp commented Aug 7, 2020

spiffxp commented Aug 7, 2020

spiffxp commented Aug 8, 2020

k8s-ci-robot commented Aug 8, 2020

spiffxp commented Jul 31, 2020 •

edited

Loading

spiffxp commented Aug 4, 2020 •

edited

Loading

spiffxp commented Aug 4, 2020 •

edited

Loading