Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dynatrace-oneagent-csi-driver provisioner gets OOMKilled #2510

Closed
reinhard-brandstaedter opened this issue Dec 19, 2023 · 2 comments
Closed
Labels
support request request for further assistance with an issue

Comments

@reinhard-brandstaedter
Copy link

Describe the bug
I'm trying to use cloudnative full stack approach with CSI driver on a bare metal K8s cluster. Everything works fine except that I get errors on the CSI driver provisioner that is OOMKilled and consequently receives BackOff events.

To Reproduce
Steps to reproduce the behavior:

  1. 3 node K8s cluster on CoreOS
NAME           STATUS   ROLES           AGE    VERSION
k8s-master     Ready    control-plane   535d   v1.28.4
k8s-worker-1   Ready    <none>          535d   v1.28.4
k8s-worker-2   Ready    <none>          535d   v1.28.4
  1. Follow installation steps here. No adjustments to template configurations made (except api tokens)
  2. Dynatrace Operator seems to work fine, Cluster is visible in Dynatrace
  3. CSI Driver provisioner starts getting OOMKills and finally Crashloop Backoffs:
NAME                                  READY   STATUS             RESTARTS        AGE
dynakube-activegate-0                 1/1     Running            1 (8m45s ago)   6h14m
dynakube-oneagent-b9pxz               1/1     Running            1 (11m ago)     6h14m
dynakube-oneagent-kfpsz               1/1     Running            1 (8m45s ago)   6h14m
dynakube-oneagent-rqd5s               1/1     Running            1 (6m50s ago)   6h14m
dynatrace-oneagent-csi-driver-b7w87   3/4     CrashLoopBackOff   82 (71s ago)    6h17m
dynatrace-oneagent-csi-driver-k8t9q   3/4     CrashLoopBackOff   82 (116s ago)   6h17m
dynatrace-oneagent-csi-driver-klsgw   3/4     CrashLoopBackOff   81 (47s ago)    6h17m
dynatrace-operator-8745d7bd6-pc9cr    1/1     Running            1 (8m55s ago)   6h17m
dynatrace-webhook-56b449b747-82rrf    1/1     Running            1 (11m ago)     6h17m
dynatrace-webhook-56b449b747-r68p2    1/1     Running            1 (8m55s ago)   6h17m

# kubectl -n dynatrace get dynakubes
NAME       APIURL                                    STATUS    AGE
dynakube   https://mfk00070.live.dynatrace.com/api   Running   6h19m
image
  1. Troubleshoot command shows OK:
# kubectl exec deploy/dynatrace-operator -n dynatrace -- dynatrace-operator troubleshoot
{"level":"info","ts":"2023-12-19T20:59:13.601Z","logger":"version","msg":"dynatrace-operator","version":"snapshot","gitCommit":"","buildDate":"","goVersion":"go1.21.5","platform":"linux/amd64"}
[oneAgentAPM] 	--- checking if OneAgentAPM object exists ...
[oneAgentAPM] 	 ✓  OneAgentAPM does not exist
[namespace ] 	--- checking if namespace 'dynatrace' exists ...
[namespace ] 	 ✓  using namespace 'dynatrace'
[crd       ] 	--- checking if CRD for Dynakube exists ...
[crd       ] 	 ✓  CRD for Dynakube exists
[dynakube  ] 	--- checking if 'dynatrace:dynakube' Dynakube is configured correctly
[dynakube  ] 	    using 'dynatrace:dynakube' Dynakube
[dynakube  ] 	    secret token 'apiToken' exists
[dynakube  ] 	    checking if syntax of API URL is valid
[dynakube  ] 	    syntax of API URL is valid
[dynakube  ] 	    checking if token scopes are valid
[dynakube  ] 	    token scopes are valid
[dynakube  ] 	    checking if can pull latest agent version
[dynakube  ] 	    API token is valid, can pull latest agent version
[dynakube  ] 	    pull secret 'dynatrace:dynakube-pull-secret' exists
[dynakube  ] 	    secret token '.dockerconfigjson' exists
[dynakube  ] 	 ✓  'dynatrace:dynakube' Dynakube is valid
[dynakube.imagepull] 	--- Verifying that OneAgent image mfk00070.live.dynatrace.com/linux/oneagent:latest can be pulled ...
[dynakube.imagepull] 	 ✓  OneAgent image mfk00070.live.dynatrace.com/linux/oneagent:latest can be successfully pulled
[dynakube.imagepull] 	--- Verifying that OneAgentCodeModules (custom image) image  can be pulled ...
[dynakube.imagepull] 	    No OneAgentCodeModules (custom image) image configured
[dynakube.imagepull] 	--- Verifying that ActiveGate image mfk00070.live.dynatrace.com/linux/activegate:latest can be pulled ...
[dynakube.imagepull] 	 ✓  ActiveGate image mfk00070.live.dynatrace.com/linux/activegate:latest can be successfully pulled
[dynakube.proxy] 	--- Analyzing proxy settings ...
[dynakube.proxy] 	 ✓  No proxy settings found.

Expected behavior
The logs do not show any pointers why this could be failing. I also checked the hostPath volumes on the nodes, which seem also fine. So i'd expect this to work just fine...

Additional context
Attaching a support archive which I pulled via the support-archive command
operator-support-archive-2023-12-19T20_10_37Z.zip

@chrismuellner chrismuellner added the support request request for further assistance with an issue label Dec 20, 2023
@reinhard-brandstaedter
Copy link
Author

Adding detailed output from csi driver pod:

% kubectl -n dynatrace describe pod dynatrace-oneagent-csi-driver-5hgmz
Name:                 dynatrace-oneagent-csi-driver-5hgmz
Namespace:            dynatrace
Priority:             1000000
Priority Class Name:  dynatrace-high-priority
Node:                 k8s-worker-2/192.168.1.103
Start Time:           Wed, 20 Dec 2023 10:39:58 +0100
Labels:               app.kubernetes.io/component=csi-driver
                      app.kubernetes.io/name=dynatrace-operator
                      app.kubernetes.io/version=0.15.0
                      controller-revision-hash=5d8fd4f86f
                      internal.oneagent.dynatrace.com/app=csi-driver
                      internal.oneagent.dynatrace.com/component=csi-driver
                      pod-template-generation=1
Annotations:          cluster-autoscaler.kubernetes.io/enable-ds-eviction: false
                      cni.projectcalico.org/podIP: 172.16.140.44/32
                      cni.projectcalico.org/podIPs: 172.16.140.44/32
                      dynatrace.com/inject: false
                      kubectl.kubernetes.io/default-container: provisioner
Status:               Running
IP:                   172.16.140.44
IPs:
  IP:           172.16.140.44
Controlled By:  DaemonSet/dynatrace-oneagent-csi-driver
Init Containers:
  csi-init:
    Container ID:  docker://5f8ffc7d072984434508e243286fd2fcd5b1d364042812c98bc7e0ae8bfec9d4
    Image:         docker.io/dynatrace/dynatrace-operator:v0.15.0
    Image ID:      docker-pullable://dynatrace/dynatrace-operator@sha256:0175f28a0176a998bd559adeef1cc69c07e876f4e6fdbc6b02f1cab71eca9179
    Port:          <none>
    Host Port:     <none>
    Args:
      csi-init
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 20 Dec 2023 10:40:16 +0100
      Finished:     Wed, 20 Dec 2023 10:40:17 +0100
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     50m
      memory:  100Mi
    Requests:
      cpu:        50m
      memory:     100Mi
    Environment:  <none>
    Mounts:
      /data from data-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bjtzj (ro)
Containers:
  server:
    Container ID:  docker://d81ed2bb448474fe6319b7c30e880c0c8854f78e18158f536643a9a2e8316789
    Image:         docker.io/dynatrace/dynatrace-operator:v0.15.0
    Image ID:      docker-pullable://dynatrace/dynatrace-operator@sha256:0175f28a0176a998bd559adeef1cc69c07e876f4e6fdbc6b02f1cab71eca9179
    Port:          10080/TCP
    Host Port:     0/TCP
    Args:
      csi-server
      --endpoint=unix://csi/csi.sock
      --node-id=$(KUBE_NODE_NAME)
      --health-probe-bind-address=:10080
    State:          Running
      Started:      Wed, 20 Dec 2023 10:40:18 +0100
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     50m
      memory:  100Mi
    Requests:
      cpu:     50m
      memory:  100Mi
    Liveness:  http-get http://:livez/livez delay=5s timeout=1s period=5s #success=1 #failure=3
    Startup:   exec [/usr/local/bin/dynatrace-operator startup-probe] delay=0s timeout=5s period=10s #success=1 #failure=1
    Environment:
      POD_NAMESPACE:   dynatrace (v1:metadata.namespace)
      KUBE_NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /csi from plugin-dir (rw)
      /data from data-dir (rw)
      /tmp from tmp-dir (rw)
      /var/lib/kubelet/pods/ from mountpoint-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bjtzj (ro)
  provisioner:
    Container ID:  docker://03339255d2640013cf9553ea1340257bb179070f8af363552df713441a896a40
    Image:         docker.io/dynatrace/dynatrace-operator:v0.15.0
    Image ID:      docker-pullable://dynatrace/dynatrace-operator@sha256:0175f28a0176a998bd559adeef1cc69c07e876f4e6fdbc6b02f1cab71eca9179
    Port:          10090/TCP
    Host Port:     0/TCP
    Args:
      csi-provisioner
      --health-probe-bind-address=:10090
    State:          Running
      Started:      Wed, 20 Dec 2023 10:48:35 +0100
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Wed, 20 Dec 2023 10:47:40 +0100
      Finished:     Wed, 20 Dec 2023 10:47:53 +0100
    Ready:          True
    Restart Count:  4
    Limits:
      cpu:     300m
      memory:  100Mi
    Requests:
      cpu:     300m
      memory:  100Mi
    Liveness:  http-get http://:livez/livez delay=5s timeout=1s period=5s #success=1 #failure=3
    Environment:
      POD_NAMESPACE:  dynatrace (v1:metadata.namespace)
    Mounts:
      /data from data-dir (rw)
      /tmp from tmp-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bjtzj (ro)
  registrar:
    Container ID:  docker://6a19aa007e1a1131b6316186c2b5c135f17e7de57b8c5a537a3cb45f602cce7f
    Image:         docker.io/dynatrace/dynatrace-operator:v0.15.0
    Image ID:      docker-pullable://dynatrace/dynatrace-operator@sha256:0175f28a0176a998bd559adeef1cc69c07e876f4e6fdbc6b02f1cab71eca9179
    Port:          <none>
    Host Port:     <none>
    Command:
      csi-node-driver-registrar
    Args:
      --csi-address=/csi/csi.sock
      --kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)
    State:          Running
      Started:      Wed, 20 Dec 2023 10:40:21 +0100
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     20m
      memory:  30Mi
    Requests:
      cpu:     20m
      memory:  30Mi
    Environment:
      DRIVER_REG_SOCK_PATH:  /var/lib/kubelet/plugins/csi.oneagent.dynatrace.com/csi.sock
    Mounts:
      /csi from plugin-dir (rw)
      /registration from registration-dir (rw)
      /var/lib/kubelet/plugins/csi.oneagent.dynatrace.com/ from lockfile-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bjtzj (ro)
  liveness-probe:
    Container ID:  docker://649e593cc20b790272082c99e52e51eb2a6b843f0db6e15e2a64567bd3cd9d5e
    Image:         docker.io/dynatrace/dynatrace-operator:v0.15.0
    Image ID:      docker-pullable://dynatrace/dynatrace-operator@sha256:0175f28a0176a998bd559adeef1cc69c07e876f4e6fdbc6b02f1cab71eca9179
    Port:          <none>
    Host Port:     <none>
    Command:
      livenessprobe
    Args:
      --csi-address=/csi/csi.sock
      --health-port=9898
    State:          Running
      Started:      Wed, 20 Dec 2023 10:40:23 +0100
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     20m
      memory:  30Mi
    Requests:
      cpu:        20m
      memory:     30Mi
    Environment:  <none>
    Mounts:
      /csi from plugin-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bjtzj (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  registration-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/plugins_registry/
    HostPathType:  Directory
  plugin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/plugins/csi.oneagent.dynatrace.com/
    HostPathType:  DirectoryOrCreate
  data-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/plugins/csi.oneagent.dynatrace.com/data
    HostPathType:  DirectoryOrCreate
  mountpoint-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/pods/
    HostPathType:  DirectoryOrCreate
  lockfile-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  tmp-dir:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-bjtzj:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              <none>
Tolerations:                 ToBeDeletedByClusterAutoscaler:NoSchedule op=Exists
                             kubernetes.io/arch=arm64:NoSchedule
                             kubernetes.io/arch=amd64:NoSchedule
                             kubernetes.io/arch=ppc64le:NoSchedule
                             node-role.kubernetes.io/control-plane:NoSchedule op=Exists
                             node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  8m44s                  default-scheduler  Successfully assigned dynatrace/dynatrace-oneagent-csi-driver-5hgmz to k8s-worker-2
  Normal   Pulling    8m44s                  kubelet            Pulling image "docker.io/dynatrace/dynatrace-operator:v0.15.0"
  Normal   Pulled     8m28s                  kubelet            Successfully pulled image "docker.io/dynatrace/dynatrace-operator:v0.15.0" in 1.13s (16.028s including waiting)
  Normal   Created    8m28s                  kubelet            Created container csi-init
  Normal   Started    8m27s                  kubelet            Started container csi-init
  Normal   Pulling    8m26s                  kubelet            Pulling image "docker.io/dynatrace/dynatrace-operator:v0.15.0"
  Normal   Pulled     8m25s                  kubelet            Successfully pulled image "docker.io/dynatrace/dynatrace-operator:v0.15.0" in 1.079s (1.079s including waiting)
  Normal   Created    8m25s                  kubelet            Created container server
  Normal   Started    8m25s                  kubelet            Started container server
  Normal   Pulling    8m24s                  kubelet            Pulling image "docker.io/dynatrace/dynatrace-operator:v0.15.0"
  Normal   Pulled     8m24s                  kubelet            Successfully pulled image "docker.io/dynatrace/dynatrace-operator:v0.15.0" in 1.096s (1.096s including waiting)
  Normal   Created    8m23s                  kubelet            Created container registrar
  Normal   Pulled     8m23s                  kubelet            Successfully pulled image "docker.io/dynatrace/dynatrace-operator:v0.15.0" in 1.075s (1.075s including waiting)
  Normal   Pulling    8m22s                  kubelet            Pulling image "docker.io/dynatrace/dynatrace-operator:v0.15.0"
  Normal   Started    8m22s                  kubelet            Started container registrar
  Normal   Pulled     8m21s                  kubelet            Successfully pulled image "docker.io/dynatrace/dynatrace-operator:v0.15.0" in 1.056s (1.056s including waiting)
  Normal   Created    8m21s                  kubelet            Created container liveness-probe
  Normal   Started    8m20s                  kubelet            Started container liveness-probe
  Normal   Pulling    2m55s (x2 over 8m25s)  kubelet            Pulling image "docker.io/dynatrace/dynatrace-operator:v0.15.0"
  Normal   Created    2m39s (x2 over 8m24s)  kubelet            Created container provisioner
  Normal   Started    2m39s (x2 over 8m24s)  kubelet            Started container provisioner
  Normal   Pulled     2m39s                  kubelet            Successfully pulled image "docker.io/dynatrace/dynatrace-operator:v0.15.0" in 1.177s (16.231s including waiting)
  Warning  BackOff    2m9s (x2 over 2m11s)   kubelet            Back-off restarting failed container provisioner in pod dynatrace-oneagent-csi-driver-5hgmz_dynatrace(dfc3e2bf-2c4c-4627-a97d-18ed9222e907)

Copy link
Contributor

Thank you for opening a Dynatrace Operator Issue. We've identified and tagged the issue as a "Support request".

Dynatrace responds to requests like these via Dynatrace ONE support rather than Github. This helps our team respond as quickly as possible using the support team's tools and procedures.

Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
support request request for further assistance with an issue
Projects
None yet
Development

No branches or pull requests

2 participants