Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-eks: neuron device plugin manifest better reference #29262

Open
freschri opened this issue Feb 26, 2024 · 3 comments
Open

aws-eks: neuron device plugin manifest better reference #29262

freschri opened this issue Feb 26, 2024 · 3 comments
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. needs-review p2

Comments

@freschri
Copy link
Contributor

Describe the bug

the neuron device plugin addon used in the cdk uses a custom manifest, see here:

const fileContents = fs.readFileSync(path.join(__dirname, 'addons', 'neuron-device-plugin.yaml'), 'utf8');

which is NOT pointing to the official neuron image (public.ecr.aws/neuron/neuron-device-plugin)
and rbac is missing
going into crashloopback and preventing metrics to be exposed

Expected Behavior

the right files are used

Current Behavior

crashloopback on deployment of inf2.xlarge

Reproduction Steps

deploy on inf2

Possible Solution

the neuron device plugin addon used in the cdk uses a custom manifest, see here:

const fileContents = fs.readFileSync(path.join(__dirname, 'addons', 'neuron-device-plugin.yaml'), 'utf8');

while there is a better existing reference from the Neuron, see description here: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-setup.html

the yaml to use is https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/k8/k8s-neuron-device-plugin.yml
and also rbac needs to be used which is not in the current implementation
const RBAC_URL = "https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/k8/k8s-neuron-device-plugin-rbac.yml

Additional Information/Context

No response

CDK CLI Version

2.130.0

Framework Version

No response

Node.js Version

v20.4.0

OS

sonoma 14.3

Language

TypeScript

Language Version

No response

Other information

No response

@freschri freschri added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Feb 26, 2024
@github-actions github-actions bot added the @aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service label Feb 26, 2024
@pahud
Copy link
Contributor

pahud commented Feb 27, 2024

Thank you for the report. I guess we probably need to update this file.
https://github.com/aws/aws-cdk/blob/f3d74bb78189ec6b76cfa85c97d993c1b26c1cac/packages/aws-cdk-lib/aws-eks/lib/addons/neuron-device-plugin.yaml

Are you interested to submit a PR for that?

@pahud pahud added p1 needs-review and removed needs-triage This issue or PR still needs to be triaged. labels Feb 27, 2024
@pahud pahud added p2 and removed p1 labels Mar 5, 2024
@wafuwafu13
Copy link
Contributor

wafuwafu13 commented Mar 29, 2024

It is repruducible.
I'm working on.

import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import * as eks from 'aws-cdk-lib/aws-eks';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as iam from 'aws-cdk-lib/aws-iam';

export class CdkIssueStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const vpc = new ec2.Vpc(this, 'VPC', {
      maxAzs: 3
    });

    const cluster = new eks.Cluster(this, 'EKSCluster', {
      vpc,
      version: eks.KubernetesVersion.V1_29,
      defaultCapacity: 0,
      mastersRole: iam.Role.fromRoleArn(this, 'Admin', "xxx", {
        mutable: false,
      })
    });

    cluster.addNodegroupCapacity('Inf2NodeGroup', {
      instanceTypes: [new ec2.InstanceType('inf2.xlarge')],
      minSize: 2,
    });
  }
}
$ kubectl describe daemonset neuron-device-plugin-daemonset -n kube-system
Name:           neuron-device-plugin-daemonset
Selector:       name=neuron-device-plugin-ds
Node-Selector:  <none>
Labels:         aws.cdk.eks/prune-xxx
Annotations:    deprecated.daemonset.template.generation: 1
Desired Number of Nodes Scheduled: 2
Current Number of Nodes Scheduled: 2
Number of Nodes Scheduled with Up-to-date Pods: 2
Number of Nodes Scheduled with Available Pods: 0
Number of Nodes Misscheduled: 0
Pods Status:  2 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:       name=neuron-device-plugin-ds
  Annotations:  scheduler.alpha.kubernetes.io/critical-pod: 
  Containers:
   k8s-neuron-device-plugin-ctr:
    Image:        790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin:1.0.9043.0
    Port:         <none>
    Host Port:    <none>
    Environment:  <none>
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
  Volumes:
   device-plugin:
    Type:               HostPath (bare host directory volume)
    Path:               /var/lib/kubelet/device-plugins
    HostPathType:       
  Priority Class Name:  system-node-critical
Events:
  Type    Reason            Age   From                  Message
  ----    ------            ----  ----                  -------
  Normal  SuccessfulCreate  37m   daemonset-controller  Created pod: neuron-device-plugin-daemonset-f578d
  Normal  SuccessfulCreate  37m   daemonset-controller  Created pod: neuron-device-plugin-daemonset-d4ksr
$ kubectl get pods -n kube-system
NAME                                   READY   STATUS             RESTARTS         AGE
aws-node-ghjqh                         2/2     Running            0                41m
aws-node-vjq99                         2/2     Running            0                42m
coredns-68bd859788-flbr4               1/1     Running            0                45m
coredns-68bd859788-wxtfv               1/1     Running            0                45m
kube-proxy-54klc                       1/1     Running            0                41m
kube-proxy-kx9rm                       1/1     Running            0                42m
neuron-device-plugin-daemonset-d4ksr   0/1     CrashLoopBackOff   12 (2m37s ago)   39m
neuron-device-plugin-daemonset-f578d   0/1     CrashLoopBackOff   12 (2m22s ago)   39m
$ kubectl describe pod neuron-device-plugin-daemonset-d4ksr -n kube-system
Name:                 neuron-device-plugin-daemonset-d4ksr
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      default
Node:                 ip-10-0-240-116.eu-west-1.compute.internal/10.0.240.116
Start Time:           Fri, 29 Mar 2024 08:55:24 +0000
Labels:               controller-revision-hash=67496f5558
                      name=neuron-device-plugin-ds
                      pod-template-generation=1
Annotations:          scheduler.alpha.kubernetes.io/critical-pod: 
Status:               Running
IP:                   10.0.201.70
IPs:
  IP:           10.0.201.70
Controlled By:  DaemonSet/neuron-device-plugin-daemonset
Containers:
  k8s-neuron-device-plugin-ctr:
    Container ID:   containerd://6e5f8d1ebdc2591edd37ccfe20c79169dc1564d2e163e0d704cbef02d957dda9
    Image:          790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin:1.0.9043.0
    Image ID:       790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin@sha256:6a0df1d6446c96b752f7abbdc9478873e2f3da05989dcaf17667076db8339728
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Fri, 29 Mar 2024 09:31:51 +0000
      Finished:     Fri, 29 Mar 2024 09:31:51 +0000
    Ready:          False
    Restart Count:  12
    Environment:    <none>
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-65qsg (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  
  kube-api-access-65qsg:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 CriticalAddonsOnly op=Exists
                             aws.amazon.com/neuron:NoSchedule op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  39m                    default-scheduler  Successfully assigned kube-system/neuron-device-plugin-daemonset-d4ksr to ip-10-0-240-116.eu-west-1.compute.internal
  Normal   Pulled     39m                    kubelet            Successfully pulled image "790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin:1.0.9043.0" in 8.068s (8.068s including waiting)
  Normal   Pulled     39m                    kubelet            Successfully pulled image "790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin:1.0.9043.0" in 683ms (683ms including waiting)
  Normal   Pulled     39m                    kubelet            Successfully pulled image "790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin:1.0.9043.0" in 672ms (672ms including waiting)
  Normal   Started    38m (x4 over 39m)      kubelet            Started container k8s-neuron-device-plugin-ctr
  Normal   Pulled     38m                    kubelet            Successfully pulled image "790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin:1.0.9043.0" in 680ms (680ms including waiting)
  Normal   Pulling    37m (x5 over 39m)      kubelet            Pulling image "790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin:1.0.9043.0"
  Normal   Created    37m (x5 over 39m)      kubelet            Created container k8s-neuron-device-plugin-ctr
  Normal   Pulled     37m                    kubelet            Successfully pulled image "790709498068.dkr.ecr.us-west-2.amazonaws.com/neuron-device-plugin:1.0.9043.0" in 678ms (678ms including waiting)
  Warning  BackOff    4m19s (x163 over 39m)  kubelet            Back-off restarting failed container k8s-neuron-device-plugin-ctr in pod neuron-device-plugin-daemonset-d4ksr_kube-system(5b998b0a-c411-4aa0-916a-4b08433213f6)
$ kubectl logs neuron-device-plugin-daemonset-d4ksr -n kube-system
neuron-device-plugin: 2024/03/29 09:31:51 Fetching devices.
neuron-device-plugin: 2024/03/29 09:31:51 Error to get IB device: open /run/infa-map.json: no such file or directory
neuron-device-plugin: 2024/03/29 09:31:51 No devices found.
neuron-device-plugin: 2024/03/29 09:31:51 Device list: []
neuron-device-plugin: 2024/03/29 09:31:51 Starting FS watcher.
neuron-device-plugin: 2024/03/29 09:31:51 Starting OS watcher.
neuron-device-plugin: 2024/03/29 09:31:51 Error to get devices: open /run/infa-map.json: no such file or directory
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x85bb96]

goroutine 1 [running]:
main.(*DevicePlugin).cleanup(0x0, 0x1, 0x1)
	/opt/workspace/KaenaTools/build/private/build/SRC_CPY/cmd/k8s-neuron-device-plugin/server.go:203 +0x26
main.(*DevicePlugin).Start(0x0, 0xc000120048, 0x10)
	/opt/workspace/KaenaTools/build/private/build/SRC_CPY/cmd/k8s-neuron-device-plugin/server.go:75 +0x2f
main.(*DevicePlugin).Serve(0x0, 0x9700e4, 0x15, 0xc0000665a0, 0x0)
	/opt/workspace/KaenaTools/build/private/build/SRC_CPY/cmd/k8s-neuron-device-plugin/server.go:229 +0x35
main.main()
	/opt/workspace/KaenaTools/build/private/build/SRC_CPY/cmd/k8s-neuron-device-plugin/main.go:64 +0x3a8

@freschri
Copy link
Contributor Author

freschri commented Apr 4, 2024

@wafuwafu13 @pahud please note my suggestion in "possible solution":
the yaml to use is https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/k8/k8s-neuron-device-plugin.yml
and also rbac needs to be used which is not in the current implementation
const RBAC_URL = "https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/master/src/k8/k8s-neuron-device-plugin-rbac.yml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-eks Related to Amazon Elastic Kubernetes Service bug This issue is a bug. needs-review p2
Projects
None yet
Development

No branches or pull requests

3 participants