Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[No space left on device Error] koordlet is throwing No space left #2028

Closed
kavita1205 opened this issue May 1, 2024 · 2 comments
Closed
Labels
kind/bug Create a report to help us improve kind/question Support request or question relating to Koordinator

Comments

@kavita1205
Copy link

What happened:
We have installed koordinator using helm chart mentioned in the documentation https://koordinator.sh/docs/installation/ and we are getting crashloopbackoff for koordlet. Can someone please help me here, how to fix this issue.

NAME                                 READY   STATUS             RESTARTS           AGE
koord-descheduler-665566d89d-k6ndg   1/1     Running            12 (19h ago)       8d
koord-descheduler-665566d89d-lrwlq   1/1     Running            10 (34h ago)       8d
koord-manager-889449d7b-f8sjp        1/1     Running            8 (19h ago)        8d
koord-manager-889449d7b-mbx54        1/1     Running            5 (12h ago)        8d
koord-manager-889449d7b-w446g        1/1     Running            6 (34h ago)        8d
koord-manager-889449d7b-zdz5s        1/1     Running            4 (3d8h ago)       8d
koord-manager-889449d7b-zw5fg        1/1     Running            5 (3d7h ago)       8d
koord-scheduler-78c4c46748-jpvn9     1/1     Running            3 (3d8h ago)       8d
koord-scheduler-78c4c46748-krlw9     1/1     Running            1 (19h ago)        8d
koord-scheduler-78c4c46748-m4xx6     1/1     Running            3 (34h ago)        8d
koord-scheduler-78c4c46748-pztdr     1/1     Running            5 (34h ago)        8d
koord-scheduler-78c4c46748-x8lxs     0/1     CrashLoopBackOff   1073 (43s ago)     8d
koordlet-49btm                       1/1     Running            0                  8d
koordlet-4klvq                       1/1     Running            0                  8d
koordlet-4zfsq                       1/1     Running            0                  8d
koordlet-5bmms                       1/1     Running            0                  8d
koordlet-72pk7                       1/1     Running            0                  8d
koordlet-7blrf                       1/1     Running            1 (7d13h ago)      8d
koordlet-lf8tx                       0/1     CrashLoopBackOff   2097 (2m58s ago)   8d
koordlet-s4dvl                       0/1     CrashLoopBackOff   2222 (2m1s ago)    8d
koordlet-xgdn4                       1/1     Running            1 (7d22h ago)      8d
koordlet-xpbcd                       1/1     Running            0                  8d
koordlet-xwxsc                       1/1     Running            0                  8d
koordlet-zrpt6                       1/1     Running            0                  8d

When I checked the logs , I found below error:

kubectl logs -n koordinator-system koordlet-lf8tx
I0501 05:30:27.393307 3414091 cgroup_driver.go:212] Node lv01-mlkfwapp-l03 use 'systemd' as cgroup driver guessed with the cgroup name
I0501 05:30:27.421618 3414091 feature_gate.go:245] feature gates: &{map[Accelerators:true BECPUEvict:true BEMemoryEvict:true CgroupReconcile:true]}
I0501 05:30:27.421756 3414091 main.go:70] Setting up kubeconfig for koordlet
I0501 05:30:27.421963 3414091 koordlet.go:76] NODE_NAME is lv01-mlkfwapp-l03, start time 1.714566627e+09
I0501 05:30:27.437494 3414091 version.go:45] [/host-cgroup/cpu/cpu.bvt_warp_ns] PathExists exists false, err: <nil>
I0501 05:30:27.438099 3414091 version.go:52] [/host-cgroup/memory/*/memory.wmark_ratio] PathExists wmark_ratio exists [], err: <nil>
I0501 05:30:27.438237 3414091 resctrl.go:74] isResctrlAvailableByCpuInfo result, isCatFlagSet: false, isMbaFlagSet: false
I0501 05:30:27.438438 3414091 resctrl.go:89] isResctrlAvailableByKernelCmd result, isCatFlagSet: false, isMbaFlagSet: false
I0501 05:30:27.438447 3414091 resctrl.go:106] IsSupportResctrl result, cpuSupport: false, kernelSupport: false
I0501 05:30:27.438462 3414091 config.go:73] resctrl supported: false
I0501 05:30:27.438475 3414091 koordlet.go:80] sysconf: &{CgroupRootDir:/host-cgroup/ CgroupKubePath:kubepods/ SysRootDir:/host-sys/ SysFSRootDir:/host-sys-fs/ ProcRootDir:/proc/ VarRunRootDir:/host-var-run/ RunRootDir:/host-run/ RuntimeHooksConfigDir:/host-etc-hookserver/ ContainerdEndPoint: PouchEndpoint: DockerEndPoint: DefaultRuntimeType:containerd}, agentMode: dsMode
I0501 05:30:27.438525 3414091 koordlet.go:81] kernel version INFO: {IsAnolisOS:false}
panic: preallocate: no space left on device

goroutine 1275 [running]:
github.com/prometheus/prometheus/tsdb.handleChunkWriteError({0x2943760?, 0xc001bd4720?})
        /go/pkg/mod/github.com/prometheus/prometheus@v0.39.2/tsdb/head_append.go:893 +0x76
github.com/prometheus/prometheus/tsdb/chunks.(*ChunkDiskMapper).WriteChunk(0xc000bd01e0, 0x41afe7?, 0x28?, 0x421545?, {0x2965af8, 0xc0004800e0}, 0x2598c90)
        /go/pkg/mod/github.com/prometheus/prometheus@v0.39.2/tsdb/chunks/head_chunks.go:418 +0x151
github.com/prometheus/prometheus/tsdb.(*memSeries).mmapCurrentHeadChunk(0xc0009144e0, 0x63e1d492?)
        /go/pkg/mod/github.com/prometheus/prometheus@v0.39.2/tsdb/head_append.go:882 +0x53
github.com/prometheus/prometheus/tsdb.(*memSeries).cutNewHeadChunk(0xc0009144e0, 0x18f0d425afa, 0x3f3091e7955b9f9a?, 0x1b7740)
        /go/pkg/mod/github.com/prometheus/prometheus@v0.39.2/tsdb/head_append.go:826 +0x2f
github.com/prometheus/prometheus/tsdb.(*memSeries).append(0xc0009144e0, 0x18f0d425afa, 0x0, 0x0, 0x0?, 0x1b7740)
        /go/pkg/mod/github.com/prometheus/prometheus@v0.39.2/tsdb/head_append.go:797 +0x1ea
github.com/prometheus/prometheus/tsdb.(*walSubsetProcessor).processWALSamples(0xc0009aafb0, 0xc00093a480, 0x4165240000000000?, 0x4a?)
        /go/pkg/mod/github.com/prometheus/prometheus@v0.39.2/tsdb/head_wal.go:463 +0x3f0
github.com/prometheus/prometheus/tsdb.(*Head).loadWAL.func7(0x29?)
        /go/pkg/mod/github.com/prometheus/prometheus@v0.39.2/tsdb/head_wal.go:110 +0x45
created by github.com/prometheus/prometheus/tsdb.(*Head).loadWAL
        /go/pkg/mod/github.com/prometheus/prometheus@v0.39.2/tsdb/head_wal.go:109 +0x414
(base) [kavsingh@sjc2-nixutil01 ~]$ kubectl logs -n koordinator-system koordlet-lf8tx
I0501 05:41:01.434444 3428678 cgroup_driver.go:212] Node lv01-mlkfwapp-l03 use 'systemd' as cgroup driver guessed with the cgroup name
I0501 05:41:01.469340 3428678 feature_gate.go:245] feature gates: &{map[Accelerators:true BECPUEvict:true BEMemoryEvict:true CgroupReconcile:true]}
I0501 05:41:01.469547 3428678 main.go:70] Setting up kubeconfig for koordlet
I0501 05:41:01.469833 3428678 koordlet.go:76] NODE_NAME is lv01-mlkfwapp-l03, start time 1.714567261e+09
I0501 05:41:01.472657 3428678 version.go:45] [/host-cgroup/cpu/cpu.bvt_warp_ns] PathExists exists false, err: <nil>
I0501 05:41:01.473513 3428678 version.go:52] [/host-cgroup/memory/*/memory.wmark_ratio] PathExists wmark_ratio exists [], err: <nil>
I0501 05:41:01.473625 3428678 resctrl.go:74] isResctrlAvailableByCpuInfo result, isCatFlagSet: false, isMbaFlagSet: false
I0501 05:41:01.473790 3428678 resctrl.go:89] isResctrlAvailableByKernelCmd result, isCatFlagSet: false, isMbaFlagSet: false
I0501 05:41:01.473798 3428678 resctrl.go:106] IsSupportResctrl result, cpuSupport: false, kernelSupport: false
I0501 05:41:01.473810 3428678 config.go:73] resctrl supported: false
I0501 05:41:01.473819 3428678 koordlet.go:80] sysconf: &{CgroupRootDir:/host-cgroup/ CgroupKubePath:kubepods/ SysRootDir:/host-sys/ SysFSRootDir:/host-sys-fs/ ProcRootDir:/proc/ VarRunRootDir:/host-var-run/ RunRootDir:/host-run/ RuntimeHooksConfigDir:/host-etc-hookserver/ ContainerdEndPoint: PouchEndpoint: DockerEndPoint: DefaultRuntimeType:containerd}, agentMode: dsMode
I0501 05:41:01.473865 3428678 koordlet.go:81] kernel version INFO: {IsAnolisOS:false}
panic: preallocate: no space left on device

goroutine 1248 [running]:
github.com/prometheus/prometheus/tsdb.handleChunkWriteError({0x2943760?, 0xc001a0e150?})
        /go/pkg/mod/github.com/prometheus/prometheus@v0.39.2/tsdb/head_append.go:893 +0x76
github.com/prometheus/prometheus/tsdb/chunks.(*ChunkDiskMapper).WriteChunk(0xc00095a2d0, 0x41afe7?, 0x28?, 0x22e89a0?, {0x2965af8, 0xc00094e660}, 0x2598c90)
        /go/pkg/mod/github.com/prometheus/prometheus@v0.39.2/tsdb/chunks/head_chunks.go:418 +0x151
github.com/prometheus/prometheus/tsdb.(*memSeries).mmapCurrentHeadChunk(0xc000a285b0, 0x63e1d492?)
        /go/pkg/mod/github.com/prometheus/prometheus@v0.39.2/tsdb/head_append.go:882 +0x53
github.com/prometheus/prometheus/tsdb.(*memSeries).cutNewHeadChunk(0xc000a285b0, 0x18f0d425ca2, 0x0?, 0x1b7740)
        /go/pkg/mod/github.com/prometheus/prometheus@v0.39.2/tsdb/head_append.go:826 +0x2f
github.com/prometheus/prometheus/tsdb.(*memSeries).append(0xc000a285b0, 0x18f0d425ca2, 0x0, 0x0, 0x179d866?, 0x1b7740)
        /go/pkg/mod/github.com/prometheus/prometheus@v0.39.2/tsdb/head_append.go:797 +0x1ea
github.com/prometheus/prometheus/tsdb.(*walSubsetProcessor).processWALSamples(0xc0017fa400, 0xc0009fe000, 0x0?, 0x0?)
        /go/pkg/mod/github.com/prometheus/prometheus@v0.39.2/tsdb/head_wal.go:463 +0x3f0
github.com/prometheus/prometheus/tsdb.(*Head).loadWAL.func7(0x0?)
        /go/pkg/mod/github.com/prometheus/prometheus@v0.39.2/tsdb/head_wal.go:110 +0x45
created by github.com/prometheus/prometheus/tsdb.(*Head).loadWAL
        /go/pkg/mod/github.com/prometheus/prometheus@v0.39.2/tsdb/head_wal.go:109 +0x414

below is the values.yaml

# Default values for koordinator.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

crds:
  managed: true

# values for koordinator installation
installation:
  namespace: koordinator-system
  roleListGroups:
    - '*'

featureGates: ""

imageRepositoryHost: ghcr.io

koordlet:
  image:
    repository: koordinator-sh/koordlet
    tag: "v1.4.1"
  resources:
    limits:
      cpu: 1000m
      memory: 1Gi
    requests:
      cpu: 500m
      memory: 512Mi
  features: ""
  log:
    # log level for koordlet
    level: "4"
  hostDirs:
    kubeletConfigDir: /etc/kubernetes/
    kubeletLibDir: /var/lib/kubelet/
    koordProxyRegisterDir: /etc/runtime/hookserver.d/
    koordletSockDir: /var/run/koordlet
    predictionCheckpointDir: /var/run/koordlet/prediction-checkpoints
    # if not specified, use tmpfs by default
    koordletTSDBDir: ""
  enableServiceMonitor: false


manager:
  # settings for log print
  log:
    # log level for koord-manager
    level: "4"

  replicas: 5
  image:
    repository: koordinator-sh/koord-manager
    tag: "v1.4.1"
  webhook:
    port: 9876
  metrics:
    port: 8080
  healthProbe:
    port: 8000

  resyncPeriod: "0"

  # resources of koord-manager container
  resources:
    limits:
      cpu: 1000m
      memory: 1Gi
    requests:
      cpu: 500m
      memory: 256Mi

  hostNetwork: false

  nodeAffinity: {}
  nodeSelector: 
    node-role.kubernetes.io/control-plane: ""
  tolerations: 
    - key: "node-role.kubernetes.io/master"
      operator: "Equal"
      effect: "NoSchedule"

webhookConfiguration:
  failurePolicy:
    pods: Ignore
    elasticquotas: Ignore
    nodeStatus: Ignore
    nodes: Ignore
  timeoutSeconds: 30

serviceAccount:
  annotations: {}


scheduler:
  # settings for log print
  log:
    # log level for koord-scheduler
    level: "4"

  replicas: 5
  image:
    repository: koordinator-sh/koord-scheduler
    tag: "v1.4.1"
  port: 10251

  # feature-gates for k8s > 1.22
  featureGates: ""
  # feature-gates for k8s 1.22
  compatible122FeatureGates: "CompatibleCSIStorageCapacity=true"
  # feature-gates for k8s < 1.22
  compatibleBelow122FeatureGates: "DisableCSIStorageCapacityInformer=true,CompatiblePodDisruptionBudget=true"

  # resources of koord-scheduler container
  resources:
    limits:
      cpu: 1000m
      memory: 1Gi
    requests:
      cpu: 500m
      memory: 256Mi

  hostNetwork: false

  nodeAffinity: {}
  nodeSelector: 
    node-role.kubernetes.io/control-plane: ""
  tolerations: 
    - key: "node-role.kubernetes.io/master"
      operator: "Equal"
      effect: "NoSchedule"

descheduler:
  # settings for log print
  log:
    # log level for koord-descheduler
    level: "4"

  replicas: 2
  image:
    repository: koordinator-sh/koord-descheduler
    tag: "v1.4.1"
  port: 10251

  featureGates: ""

  # resources of koord-descheduler container
  resources:
    limits:
      cpu: 1000m
      memory: 1Gi
    requests:
      cpu: 500m
      memory: 256Mi

  hostNetwork: false

  nodeAffinity: {}
  nodeSelector: 
    node-role.kubernetes.io/control-plane: ""
  tolerations: 
    - key: "node-role.kubernetes.io/master"
      operator: "Equal"
      effect: "NoSchedule"

What you expected to happen:
All pods should run properly without error.
How to reproduce it (as minimally and precisely as possible):
You can use the values.yaml mentioned above and reproduce this issue.
Anything else we need to know?:

Environment:

  • App version: 1.4.1
  • Kubernetes version (use kubectl version): 1.24
  • Install details (e.g. helm install args): helm install koordinator koordinator-sh/koordinator --version 1.4.1 -f values.yaml
  • Node environment (for koordlet/runtime-proxy issue):
    • Containerd/Docker version:
    • OS version: Ubuntu 22.04
    • Kernal version:
    • Cgroup driver: cgroupfs/systemd
  • Others:
@kavita1205 kavita1205 added the kind/bug Create a report to help us improve label May 1, 2024
@saintube
Copy link
Member

saintube commented May 6, 2024

@kavita1205 It seems some koordlet pods panic while the others work well. Please check the nodes that the koordlet pods panic have enough memory.

@saintube saintube added the kind/question Support request or question relating to Koordinator label May 7, 2024
@kavita1205
Copy link
Author

@saintube This issue is fixed by increasing disk space on node. Thankyou very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Create a report to help us improve kind/question Support request or question relating to Koordinator
Projects
None yet
Development

No branches or pull requests

2 participants