Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Containerd + MicroK8s #7

Closed
3 of 8 tasks
tvansteenburgh opened this issue Oct 7, 2019 · 11 comments
Closed
3 of 8 tasks

Containerd + MicroK8s #7

tvansteenburgh opened this issue Oct 7, 2019 · 11 comments

Comments

@tvansteenburgh
Copy link

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

MicroK8s node never gets nvidia.com/gpu annotation. Details below. Note that MicroK8s uses containerd. Does the gpu-operator only work with Docker right now?

2. Steps to reproduce the issue

I began with a new p2.xlarge ec2 instance running Ubuntu 18.04.3.

$ sudo snap install microk8s --channel 1.15/stable --classic                      
microk8s (1.15/stable) v1.15.4 from Canonical✓ installed                                                   
$ lspci | grep -i nvidia                                                            
00:1e.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)                                    
$ sudo lshw -c video                                                              
  *-display:0 UNCLAIMED
       description: VGA compatible controller
       product: GD 5446                                                                                                 
       vendor: Cirrus Logic                                                                                              
       physical id: 2                                                                                                    
       bus info: pci@0000:00:02.0                                                                                         
       version: 00                                                                                                      
       width: 32 bits      
       clock: 33MHz                                                                                                                            
       capabilities: vga_controller bus_master                                                                                                 
       configuration: latency=0                                                                                                                
       resources: memory:80000000-81ffffff memory:86004000-86004fff memory:c0000-dffff
  *-display:1 UNCLAIMED                                                                        
       description: 3D controller                                                              
       product: GK210GL [Tesla K80]                                                             
       vendor: NVIDIA Corporation                                                                
       physical id: 1e                                  
       bus info: pci@0000:00:1e.0                                                                    
       version: a1                                                                                   
       width: 64 bits                                                                                 
       clock: 33MHz                                                                                    
       capabilities: pm msi pciexpress bus_master cap_list                                        
       configuration: latency=0                                    
       resources: iomemory:100-ff memory:84000000-84ffffff memory:1000000000-13ffffffff memory:82000000-83ffffff        
$  lsb_release -a                                                    
No LSB modules are available.                                                                        
Distributor ID: Ubuntu                             
Description:    Ubuntu 18.04.3 LTS                                 
Release:        18.04         
Codename:       bionic                                          
$ sudo modprobe -a i2c_core ipmi_msghandler
$ sudo usermod -a -G microk8s ubuntu                                                  
$ newgrp microk8s  
$ microk8s.enable dns helm
Enabling DNS                                                                                       
Applying manifest                                                                                     
serviceaccount/coredns created                                                                        
configmap/coredns created                                                                               
deployment.apps/coredns created                                                                         
service/kube-dns created                                                                                
clusterrole.rbac.authorization.k8s.io/coredns created                                                   
clusterrolebinding.rbac.authorization.k8s.io/coredns created                                            
Restarting kubelet                                                   
DNS is enabled                
Enabling Helm                 
Fetching helm version v2.14.3.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 25.3M  100 25.3M    0     0  89.7M      0 --:--:-- --:--:-- --:--:-- 90.0M
Helm is enabled                                                                                                                                              
$ microk8s.helm init
Creating /home/ubuntu/.helm
Creating /home/ubuntu/.helm/repository
Creating /home/ubuntu/.helm/repository/cache
Creating /home/ubuntu/.helm/repository/local                  
Creating /home/ubuntu/.helm/plugins
Creating /home/ubuntu/.helm/starters                                                            
Creating /home/ubuntu/.helm/cache/archive                                                       
Creating /home/ubuntu/.helm/repository/repositories.yaml                                        
Adding stable repo with URL: https://kubernetes-charts.storage.googleapis.com
Adding local repo with URL: http://127.0.0.1:8879/charts
$HELM_HOME has been configured at /home/ubuntu/.helm.       
                                                            
Tiller (the Helm server-side component) has been installed into your Kubernetes Cluster.
                
Please note: by default, Tiller is deployed with an insecure 'allow unauthenticated users' policy.
To prevent this, run `helm init` with the --tiller-tls-verify flag.
For more information on securing your installation see: https://docs.helm.sh/using_helm/#securing-your-helm-installation
$ microk8s.helm repo add nvidia https://nvidia.github.io/gpu-operator
"nvidia" has been added to your repositories            
$ microk8s.helm repo update
Hang tight while we grab the latest from your chart repositories...      
...Skip local chart repository                                           
...Successfully got an update from the "nvidia" chart repository         
...Successfully got an update from the "stable" chart repository         
Update Complete.                                                  
$ microk8s.kubectl label nodes $(hostname) node-role.kubernetes.io/master=
node/ip-172-31-25-86 labeled                                    
$ microk8s.helm install --devel nvidia/gpu-operator -n test-operator --wait
NAME:   test-operator                                                                                        
LAST DEPLOYED: Mon Oct  7 17:04:17 2019
NAMESPACE: default
STATUS: DEPLOYED

RESOURCES:
==> v1/ClusterRole
NAME                       AGE
nfd-master                 19s
special-resource-operator  19s

==> v1/ClusterRoleBinding
NAME                       AGE
nfd-master                 19s
special-resource-operator  19s

==> v1/ConfigMap
NAME             DATA  AGE
operator-config  2     19s

==> v1/DaemonSet
NAME        DESIRED  CURRENT  READY  UP-TO-DATE  AVAILABLE  NODE SELECTOR                    AGE
nfd-master  1        1        1      1           1          node-role.kubernetes.io/master=  19s
nfd-worker  1        1        1      1           1          <none>                           19s

==> v1/Deployment
NAME                       READY  UP-TO-DATE  AVAILABLE  AGE
special-resource-operator  1/1    1           1          19s

==> v1/Namespace
NAME                    STATUS  AGE
gpu-operator            Active  19s
gpu-operator-resources  Active  19s
node-feature-discovery  Active  19s

==> v1/Pod(related)
NAME                                        READY  STATUS   RESTARTS  AGE
nfd-master-88ctj                            1/1    Running  0         19s
nfd-worker-z9gq6                            1/1    Running  1         19s
special-resource-operator-78c7499d65-2l79k  1/1    Running  0         19s

==> v1/Service
NAME        TYPE       CLUSTER-IP     EXTERNAL-IP  PORT(S)   AGE
nfd-master  ClusterIP  10.152.183.37  <none>       8080/TCP  19s

==> v1/ServiceAccount
NAME                       SECRETS  AGE
nfd-master                 1        19s
special-resource-operator  1        19s

==> v1beta1/CustomResourceDefinition
NAME                               AGE
specialresources.sro.openshift.io  19s

$ microk8s.kubectl get po -A -o wide
NAMESPACE                NAME                                         READY   STATUS    RESTARTS   AGE     IP             NODE              NOMINATED NODE   READINESS GATES
gpu-operator             special-resource-operator-78c7499d65-2l79k   1/1     Running   0          42s     10.1.1.6       ip-172-31-25-86   <none>           <none>
kube-system              coredns-f7867546d-vnt8h                      1/1     Running   0          2m15s   10.1.1.3       ip-172-31-25-86   <none>           <none>
kube-system              tiller-deploy-75f6c87b87-j5pp7               1/1     Running   0          115s    10.1.1.4       ip-172-31-25-86   <none>           <none>
node-feature-discovery   nfd-master-88ctj                             1/1     Running   0          42s     10.1.1.5       ip-172-31-25-86   <none>           <none>
node-feature-discovery   nfd-worker-z9gq6                             1/1     Running   1          42s     172.31.25.86   ip-172-31-25-86   <none>           <none>
$ microk8s.kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/manifests/cr/sro_cr_sched_none.yaml
specialresource.sro.openshift.io/gpu created
$ microk8s.kubectl get all -A
NAMESPACE                NAME                                             READY   STATUS    RESTARTS   AGE
gpu-operator             pod/special-resource-operator-78c7499d65-2l79k   1/1     Running   0          96s
kube-system              pod/coredns-f7867546d-vnt8h                      1/1     Running   0          3m9s
kube-system              pod/tiller-deploy-75f6c87b87-j5pp7               1/1     Running   0          2m49s
node-feature-discovery   pod/nfd-master-88ctj                             1/1     Running   0          96s
node-feature-discovery   pod/nfd-worker-z9gq6                             1/1     Running   1          96s


NAMESPACE                NAME                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                  AGE
default                  service/kubernetes      ClusterIP   10.152.183.1     <none>        443/TCP                  5m8s
kube-system              service/kube-dns        ClusterIP   10.152.183.10    <none>        53/UDP,53/TCP,9153/TCP   3m9s
kube-system              service/tiller-deploy   ClusterIP   10.152.183.181   <none>        44134/TCP                2m49s
node-feature-discovery   service/nfd-master      ClusterIP   10.152.183.37    <none>        8080/TCP                 96s

NAMESPACE                NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
node-feature-discovery   daemonset.apps/nfd-master   1         1         1       1            1           node-role.kubernetes.io/master=   96s
node-feature-discovery   daemonset.apps/nfd-worker   1         1         1       1            1           <none>                            96s

NAMESPACE      NAME                                        READY   UP-TO-DATE   AVAILABLE   AGE
gpu-operator   deployment.apps/special-resource-operator   1/1     1            1           96s
kube-system    deployment.apps/coredns                     1/1     1            1           3m9s
kube-system    deployment.apps/tiller-deploy               1/1     1            1           2m49s

NAMESPACE      NAME                                                   DESIRED   CURRENT   READY   AGE
gpu-operator   replicaset.apps/special-resource-operator-78c7499d65   1         1         1       96s
kube-system    replicaset.apps/coredns-f7867546d                      1         1         1       3m9s
kube-system    replicaset.apps/tiller-deploy-75f6c87b87               1         1         1       2m49s

$ microk8s.kubectl apply -f https://nvidia.github.io/gpu-operator/notebook-example.yml
service/tf-notebook created
pod/tf-notebook created
$ microk8s.kubectl get po -A --watch
NAMESPACE                NAME                                         READY   STATUS    RESTARTS   AGE
default                  tf-notebook                                  0/1     Pending   0          16s
gpu-operator             special-resource-operator-78c7499d65-2l79k   1/1     Running   0          2m22s
kube-system              coredns-f7867546d-vnt8h                      1/1     Running   0          3m55s
kube-system              tiller-deploy-75f6c87b87-j5pp7               1/1     Running   0          3m35s
node-feature-discovery   nfd-master-88ctj                             1/1     Running   0          2m22s
node-feature-discovery   nfd-worker-z9gq6                             1/1     Running   1          2m22s
$ microk8s.kubectl describe pod tf-notebook
Name:         tf-notebook
Namespace:    default
Priority:     0
Node:         <none>
Labels:       app=tf-notebook
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"labels":{"app":"tf-notebook"},"name":"tf-notebook","namespace":"default"},"s...
Status:       Pending
IP:
Containers:
  tf-notebook:
    Image:      gcr.io/kubeflow/tensorflow-notebook-gpu:latest
    Port:       8888/TCP
    Host Port:  0/TCP
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-txjj5 (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  default-token-txjj5:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-txjj5
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  56s   default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu.

The tf-notebook pod never leaves pending.

3. Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods --all-namespaces
$ microk8s.kubectl get pods --all-namespaces
NAMESPACE                NAME                                         READY   STATUS    RESTARTS   AGE
default                  tf-notebook                                  0/1     Pending   0          12m
gpu-operator             special-resource-operator-78c7499d65-2l79k   1/1     Running   0          14m
kube-system              coredns-f7867546d-vnt8h                      1/1     Running   0          16m
kube-system              tiller-deploy-75f6c87b87-j5pp7               1/1     Running   0          16m
node-feature-discovery   nfd-master-88ctj                             1/1     Running   0          14m
node-feature-discovery   nfd-worker-z9gq6                             1/1     Running   1          14m
  • Output of running a container on the GPU machine: docker run -it alpine echo foo

  • Docker configuration file: cat /etc/docker/daemon.json

  • Docker runtime configuration: docker info | grep runtime
    There is no Docker - MicroK8s uses containerd.

  • NVIDIA shared directory: ls -la /run/nvidia

$ ls -la /run/nvidia
ls: cannot access '/run/nvidia': No such file or directory
  • NVIDIA packages directory: ls -la /run/nvidia/toolkit
  • NVIDIA driver directory: ls -la /run/nvidia/driver
  • kubelet logs journalctl -u kubelet > kubelet.logs
$ journalctl -u snap.microk8s.daemon-containerd.service -o cat | pastebinit
http://paste.ubuntu.com/p/73xkpkNX7D/
@RenaudWasTaken
Copy link
Contributor

Hello!

We don't support containerd just yet :)
It will be here soonish though (1-2 months)!

@RenaudWasTaken RenaudWasTaken changed the title Running on MicroK8s Containerd + MicroK8s Oct 7, 2019
@RenaudWasTaken
Copy link
Contributor

Thanks for looking into this by the way!

@tvansteenburgh
Copy link
Author

@RenaudWasTaken Hey, my pleasure, and thanks for the quick response. Eager to try this again when containerd support is added.

@nvjmayo
Copy link
Contributor

nvjmayo commented Jan 21, 2020

Marking this as a feature so we don't lose track of it.

@nvjmayo
Copy link
Contributor

nvjmayo commented Jul 27, 2020

Just wanted to update that we currently don't support MicroK8s. Once we have support for MicroK8s, we will provide updates. Thanks for your interest in GPU Operator.

@tvansteenburgh
Copy link
Author

Once the GPU Operator works with containerd, we can make it work on MicroK8s. The only thing preventing MicroK8s from using the GPU Operator today is the lack of containerd support. The problem is not unique to MicroK8s; GPU Operator won't work on any k8s cluster that uses containerd instead of docker.

@jtm5044
Copy link

jtm5044 commented Jul 30, 2020

+1 for containerd/microk8s support. I know in october it was estimated at 1-2 months away, but I also know priorities change. Is this still on the roadmap? Is it something I can expect? Thanks!

@nvjmayo
Copy link
Contributor

nvjmayo commented Aug 3, 2020

Yes it is still on the roadmap, and we would still like to support containerd. But it is still not a priority and we don't have someone working on it at this time.
We'll post a better ETA soonish. ;-)

@CecileRobertMichon
Copy link

+1 to containerd support

This would allow us to consume the operator with Cluster API clusters (ref: kubernetes-sigs/cluster-api-provider-azure#426)

@jessehu
Copy link

jessehu commented Nov 12, 2020

Thanks @nvjmayo @RenaudWasTaken. Do we have a ticket to track the containerd support and when it will be released?

@klueska
Copy link
Contributor

klueska commented Dec 2, 2020

containerd support is coming in the 1.4.0 release of the GPU operator (scheduled for release this week).

Here is a list of the PRs that enabled it:
https://gitlab.com/nvidia/container-toolkit/container-config/-/merge_requests/40
https://gitlab.com/nvidia/container-toolkit/container-config/-/merge_requests/43
https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/145

The following flags can be used to control containerd support in values.yaml:

operator:
  defaultRuntime: containerd

toolkit:
  env:
  - name: CONTAINERD_CONFIG
    value: /etc/containerd/config.toml
  - name: CONTAINERD_SOCKET
    value: /run/containerd/containerd.sock
  - name: CONTAINERD_RUNTIME_CLASS
    value: nvidia-container-runtime
  - name: CONTAINERD_SET_AS_DEFAULT
    value: true

First you set the runtime to containerd under the operator component. Then then you customize the specific containerd settings via environment variables on the toolkit component.

The settings I’m showing above for the toolkit component are the default ones if you don’t set them to anything at all.

In the context of microk8s, you would want to set CONTAINERD_CONFIG and CONTAINERD_SOCKET to point to the containerd config and socket files inside the snap sandbox.

The CONTAINERD_RUNTIME_CLASS variable can then be optionally set to customize the name of the runtime-class associated with the nvidia-container-runtime in the containerd config file.

Likewise, CONTAINERD_SET_AS_DEFAULT can be optionally set to toggle whether you want the nvidia-container-runtime to be the default runtime used by containerd (i.e. when launching containers in a pod without an explicit RuntimeClass), or whether you only want it to only be used when a pod sets its RuntimeClass to match the name set in CONTAINERD_RUNTIME_CLASS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants