Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Support GPU nodes with "nvidia-gpu" flavor #1002

Merged
merged 1 commit into from
Oct 29, 2020

Conversation

mboersma
Copy link
Contributor

@mboersma mboersma commented Oct 18, 2020

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds the nvidia-gpu flavor to support Azure N-series SKUs with NVIDIA GPUs. Creating a workload cluster from that flavor provides nvidia.com/gpu schedulable resources on agent nodes.

Which issue(s) this PR fixes:

Fixes #426

Special notes for your reviewer:

Many thanks to @sozercan for figuring out the essential commands and containerd config used here!

Note that NVv4-series GPUs are not supported. (Those VMs use an AMD GPU and are only supported on Windows.)

TODOs:

  • squashed commits
  • includes documentation
  • adds e2e tests

Release note:

✨ Support GPU nodes with "nvidia-gpu" flavor

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 18, 2020
@k8s-ci-robot k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 18, 2020
@mboersma mboersma force-pushed the nvidia-gpu-flavor branch 2 times, most recently from bf62412 to b20cb18 Compare October 19, 2020 15:27
useExperimentalRetryJoin: true
postKubeadmCommands:
# Install the NVIDIA device plugin for Kubernetes
- KUBECONFIG=/etc/kubernetes/admin.conf kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why run the nvidia install script on the control plane nodes if only the worker nodes are GPU enabled?

Copy link
Contributor Author

@mboersma mboersma Oct 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the manifest is just a daemonset that will schedule itself on GPU agent nodes, so it could be installed from anywhere (the original version of this PR had the user do it), and because I know I have kubectl and a kubeconfig on the control plane nodes, but I don't think that's true on the agent nodes.

@mboersma
Copy link
Contributor Author

/retest

Copy link
Contributor

@devigned devigned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 19, 2020
@devigned
Copy link
Contributor

/approve cancel

@k8s-ci-robot k8s-ci-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 19, 2020
@devigned
Copy link
Contributor

@CecileRobertMichon looks like GH code review approve is triggering the /approve behavior. Just a heads up.

Also, me canceling approve is just to give other a chance to comment.

/assign @CecileRobertMichon

secret:
name: ${CLUSTER_NAME}-md-0-azure-json
key: worker-node-azure.json
- path: /etc/containerd/nvidia-config.toml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this file taken from somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, from @sozercan's gist at https://gist.github.com/sozercan/51a569cf173ef7e57a375978af8edf26 which he linked to in #426. Not sure if it has other origins.

@CecileRobertMichon
Copy link
Contributor

I tried deploying a GPU cluster using tilt with the nvidia-gpu flavor and calico is not coming up on the worker nodes:

k --kubeconfig ./kubeconfig get pods -A -o wide  
NAMESPACE     NAME                                                              READY   STATUS              RESTARTS   AGE     IP                NODE                                      NOMINATED NODE   READINESS GATES
kube-system   calico-kube-controllers-59d7f84b55-jwh22                          1/1     Running             0          12m     192.168.150.131   nvidia-gpu-template-control-plane-lswkl   <none>           <none>
kube-system   calico-node-6q975                                                 1/1     Running             0          12m     10.0.0.4          nvidia-gpu-template-control-plane-lswkl   <none>           <none>
kube-system   calico-node-cxlrz                                                 0/1     Init:0/3            0          9m7s    10.1.0.5          nvidia-gpu-template-md-0-zd5gx            <none>           <none>
kube-system   calico-node-d9frf                                                 0/1     Init:0/3            0          9m19s   10.1.0.4          nvidia-gpu-template-md-0-95bmz            <none>           <none>
kube-system   coredns-66bff467f8-56mt8                                          1/1     Running             0          12m     192.168.150.130   nvidia-gpu-template-control-plane-lswkl   <none>           <none>
kube-system   coredns-66bff467f8-fz8d9                                          1/1     Running             0          12m     192.168.150.129   nvidia-gpu-template-control-plane-lswkl   <none>           <none>
kube-system   etcd-nvidia-gpu-template-control-plane-lswkl                      1/1     Running             0          12m     10.0.0.4          nvidia-gpu-template-control-plane-lswkl   <none>           <none>
kube-system   kube-apiserver-nvidia-gpu-template-control-plane-lswkl            1/1     Running             0          12m     10.0.0.4          nvidia-gpu-template-control-plane-lswkl   <none>           <none>
kube-system   kube-controller-manager-nvidia-gpu-template-control-plane-lswkl   1/1     Running             0          12m     10.0.0.4          nvidia-gpu-template-control-plane-lswkl   <none>           <none>
kube-system   kube-proxy-5mzdj                                                  0/1     ContainerCreating   0          9m7s    10.1.0.5          nvidia-gpu-template-md-0-zd5gx            <none>           <none>
kube-system   kube-proxy-htkg2                                                  0/1     ContainerCreating   0          9m19s   10.1.0.4          nvidia-gpu-template-md-0-95bmz            <none>           <none>
kube-system   kube-proxy-l6pvz                                                  1/1     Running             0          12m     10.0.0.4          nvidia-gpu-template-control-plane-lswkl   <none>           <none>
kube-system   kube-scheduler-nvidia-gpu-template-control-plane-lswkl            1/1     Running             0          12m     10.0.0.4          nvidia-gpu-template-control-plane-lswkl   <none>           <none>

I'm seeing this when describing the pod:

  Warning  FailedCreatePodSandBox  5m58s                  kubelet, nvidia-gpu-template-md-0-zd5gx  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/0d282c496afbc8ec5cf57125fcd88e99d85ef43d7668ab7a92d82378a02fe29b/log.json: no such file or directory): exec: "nvidia-container-runtime": executable file not found in $PATH: unknown

This is with VM size Standard_NV6 (because that's where I had quota) and Standard_LRS storage type.

@mboersma
Copy link
Contributor Author

I also had to use Standard_NV6 for testing because I didn't have quota for the other types. (I do now.) I've made a bunch of GPU-enabled clusters with this code but haven't seen that error (yet).

nvidia-container-runtime": executable file not found in $PATH

Do you have access to the nodes? Could you see if nvidia-smi works there and if the nvidia-plugin daemonset is running?

@CecileRobertMichon
Copy link
Contributor

I see

 k --kubeconfig ./kubeconfig get daemonsets.apps -A
NAMESPACE     NAME                             DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
kube-system   calico-node                      3         3         1       3            1           kubernetes.io/os=linux   33m
kube-system   kube-proxy                       3         3         1       3            1           kubernetes.io/os=linux   33m
kube-system   nvidia-device-plugin-daemonset   0         0         0       0            0           <none>                   33m

@CecileRobertMichon
Copy link
Contributor

Ha

[  182.744200] cloud-init[1793]: [2020-10-20 19:55:04] Reading package lists...
[  182.744558] cloud-init[1793]: [2020-10-20 19:55:04] Building dependency tree...
[  182.744926] cloud-init[1793]: [2020-10-20 19:55:04] Reading state information...
[  182.745302] cloud-init[1793]: [2020-10-20 19:55:04] E: Unable to locate package nvidia-container-runtime

from cloud init logs on one of the nodes

@mboersma
Copy link
Contributor Author

mboersma commented Oct 23, 2020

I added an e2e test spec for the nvidia-gpu flavor following the pattern of machinepool and friends. Some things still to be considered here:

  • Use Standard_LRS storage in conjunction with Standard_NV6 for the least expensive test SKU with a GPU
  • Skip this spec entirely by default and set it up as a periodic job
  • Investigate whether this test sub has access to N-series SKUs in multiple regions and restrict accordingly
  • Should we add a GPU-enabled node pool to an existing spec instead of building a separate cluster?

@CecileRobertMichon
Copy link
Contributor

Skip this spec entirely by default and set it up as a periodic job

I would add setting up a presubmit job "e2e-full" or something like that to run the whole spec optionally on PRs

@mboersma mboersma force-pushed the nvidia-gpu-flavor branch 2 times, most recently from f7f366f to 2a2c913 Compare October 27, 2020 20:36
@mboersma mboersma changed the title [WIP] ✨ Support GPU nodes with "nvidia-gpu" flavor ✨ Support GPU nodes with "nvidia-gpu" flavor Oct 27, 2020
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 27, 2020
@mboersma
Copy link
Contributor Author

/retest

Copy link
Contributor

@devigned devigned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hate to do this, but 1 super small item to address. Outside of that, lgtm.

templates/test/prow-nvidia-gpu/cni-resource-set.yaml Outdated Show resolved Hide resolved
@mboersma
Copy link
Contributor Author

/test ?

@k8s-ci-robot
Copy link
Contributor

@mboersma: The following commands are available to trigger jobs:

  • /test pull-cluster-api-provider-azure-test
  • /test pull-cluster-api-provider-azure-build
  • /test pull-cluster-api-provider-azure-e2e
  • /test pull-cluster-api-provider-azure-e2e-full
  • /test pull-cluster-api-provider-azure-capi-e2e
  • /test pull-cluster-api-provider-azure-verify
  • /test pull-cluster-api-provider-azure-conformance-v1alpha3
  • /test pull-cluster-api-provider-azure-apidiff
  • /test pull-cluster-api-provider-azure-coverage

Use /test all to run the following jobs:

  • pull-cluster-api-provider-azure-test
  • pull-cluster-api-provider-azure-build
  • pull-cluster-api-provider-azure-e2e
  • pull-cluster-api-provider-azure-e2e-full
  • pull-cluster-api-provider-azure-verify
  • pull-cluster-api-provider-azure-apidiff
  • pull-cluster-api-provider-azure-coverage

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mboersma
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e-full

@mboersma
Copy link
Contributor Author

Looks like the GPU-enabled cluster provisioned and passed:

...
�[1mSTEP�[0m: Waiting for the workload nodes to exist
INFO: Waiting for the machine pools to be provisioned
�[1mSTEP�[0m: creating a Kubernetes client to the workload cluster
�[1mSTEP�[0m: running a CUDA vector calculation job
�[1mSTEP�[0m: waiting for job default/cuda-vector-add to be complete
�[1mSTEP�[0m: creating Azure clients with the workload cluster's subscription
�[1mSTEP�[0m: verifying EnableAcceleratedNetworking for the primary NIC of each VM
�[1mSTEP�[0m: Dumping logs from the "capz-e2e-796qgc" workload cluster
...

Copy link
Contributor

@devigned devigned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

/assign @CecileRobertMichon

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 29, 2020
Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

I commented on kubernetes/test-infra#19715 (comment) after it merged, the e2e-full job should not run by default on PRs (right now it's being auto triggered because of the runIfChanged value), this requires a follow up to test-infra

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 29, 2020
@mboersma
Copy link
Contributor Author

mboersma commented Oct 29, 2020

requires a follow up to test-infra

Yes, I saw--thanks for catching that. I'll make a PR to fix it.

Update: see kubernetes/test-infra#19756

@k8s-ci-robot k8s-ci-robot merged commit 272261b into kubernetes-sigs:master Oct 29, 2020
@k8s-ci-robot k8s-ci-robot added this to the v0.4.10 milestone Oct 29, 2020
@mboersma mboersma deleted the nvidia-gpu-flavor branch October 29, 2020 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for GPU nodes
4 participants