Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] prometheus server status CrashLoopBackOff #1028

Closed
linghan-hub opened this issue Jan 17, 2023 · 9 comments · Fixed by #1440
Closed

[BUG] prometheus server status CrashLoopBackOff #1028

linghan-hub opened this issue Jan 17, 2023 · 9 comments · Fixed by #1440
Assignees
Labels
kind/bug Something isn't working
Milestone

Comments

@linghan-hub
Copy link
Collaborator

Describe the bug
Enable monitoring prometheus server status CrashLoopBackOff

To Reproduce
Steps to reproduce the behavior:

  1. install kubeblocks
helm repo add kubeblocks https://apecloud.github.io/helm-charts
or
helm repo update kubeblocks

helm upgrade --install kubeblocks kubeblocks/kubeblocks --version "0.3.0-beta.1"
  1. kubectl get pod
kubectl get pod
NAME                                   READY   STATUS             RESTARTS       AGE
kubeblocks-554fd6548b-mqh5r            1/1     Running            0              29m
kubeblocks-grafana-94c6975ff-ddg74     3/3     Running            0              29m
kubeblocks-prometheus-alertmanager-0   2/2     Running            0              29m
kubeblocks-prometheus-server-0         1/2     CrashLoopBackOff   10 (16s ago)   29m
  1. kubectl logs kubeblocks-prometheus-server-0
kubectl logs kubeblocks-prometheus-server-0
Defaulted container "prometheus-server-configmap-reload" out of: prometheus-server-configmap-reload, prometheus-server
2023/01/17 02:24:34 Watching directory: "/etc/config"
  1. kubectl logs kubeblocks-prometheus-server-0 -c prometheus-server
kubectl logs kubeblocks-prometheus-server-0 -c prometheus-server
ts=2023-01-17T02:51:54.989Z caller=main.go:543 level=info msg="Starting Prometheus Server" mode=server version="(version=2.39.1, branch=HEAD, revision=dcd6af9e0d56165c6f5c64ebbc1fae798d24933a)"
ts=2023-01-17T02:51:54.989Z caller=main.go:548 level=info build_context="(go=go1.19.2, user=root@273d60c69592, date=20221007-16:03:45)"
ts=2023-01-17T02:51:54.989Z caller=main.go:549 level=info host_details="(Linux 5.10.124-linuxkit #1 SMP PREEMPT Thu Jun 30 08:18:26 UTC 2022 aarch64 kubeblocks-prometheus-server-0 (none))"
ts=2023-01-17T02:51:54.989Z caller=main.go:550 level=info fd_limits="(soft=1048576, hard=1048576)"
ts=2023-01-17T02:51:54.989Z caller=main.go:551 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2023-01-17T02:51:54.989Z caller=query_logger.go:91 level=error component=activeQueryTracker msg="Error opening query log file" file=/data/queries.active err="open /data/queries.active: permission denied"
panic: Unable to create mmap-ed active query log

goroutine 1 [running]:
github.com/prometheus/prometheus/promql.NewActiveQueryTracker({0xfffff3fe37a6, 0x5}, 0x14, {0x34691c0, 0x40004bc640})
	/app/promql/query_logger.go:121 +0x2ec
main.main()
	/app/cmd/prometheus/main.go:605 +0x613c
  1. kubectl logs kubeblocks-prometheus-server-0 -c prometheus-server-configmap-reload
kubectl logs kubeblocks-prometheus-server-0 -c prometheus-server-configmap-reload
2023/01/17 02:24:34 Watching directory: "/etc/config"
@linghan-hub linghan-hub added the kind/bug Something isn't working label Jan 17, 2023
@linghan-hub linghan-hub added this to the Release 0.3.0 milestone Jan 17, 2023
@JashBook
Copy link
Collaborator

JashBook commented Jan 17, 2023

Error in minikube, success in k3d.

kubectl describe pod kubeblocks-prometheus-server-0
 
Name:             kubeblocks-prometheus-server-0
Namespace:        default
Priority:         0
Service Account:  kubeblocks-prometheus-server
Node:             minikube-m02/192.168.76.3
Start Time:       Tue, 17 Jan 2023 13:34:44 +0800
Labels:           app=prometheus
                  chart=prometheus-15.16.1
                  component=server
                  controller-revision-hash=kubeblocks-prometheus-server-699bb9f7d9
                  heritage=Helm
                  release=kubeblocks
                  statefulset.kubernetes.io/pod-name=kubeblocks-prometheus-server-0
Annotations:      <none>
Status:           Running
IP:               10.244.1.8
IPs:
  IP:           10.244.1.8
Controlled By:  StatefulSet/kubeblocks-prometheus-server
Containers:
  prometheus-server-configmap-reload:
    Container ID:  docker://0a1dceecbef552c05278326d3d2b2cafe9228f0794e3b6cf4ef8b8130af086c6
    Image:         jimmidyson/configmap-reload:v0.5.0
    Image ID:      docker-pullable://jimmidyson/configmap-reload@sha256:904d08e9f701d3d8178cb61651dbe8edc5d08dd5895b56bdcac9e5805ea82b52
    Port:          <none>
    Host Port:     <none>
    Args:
      --volume-dir=/etc/config
      --webhook-url=http://127.0.0.1:9090/-/reload
    State:          Running
      Started:      Tue, 17 Jan 2023 13:35:02 +0800
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /etc/config from config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-q545s (ro)
  prometheus-server:
    Container ID:  docker://42ffa3f1d002fb20bfe6137f9fa65eea3fab9ecdf975a078bd3ce3ac31ba20c7
    Image:         quay.io/prometheus/prometheus:v2.39.1
    Image ID:      docker-pullable://quay.io/prometheus/prometheus@sha256:4748e26f9369ee7270a7cd3fb9385c1adb441c05792ce2bce2f6dd622fd91d38
    Port:          9090/TCP
    Host Port:     0/TCP
    Args:
      --storage.tsdb.retention.time=15d
      --config.file=/etc/config/prometheus.yml
      --storage.tsdb.path=/data
      --web.console.libraries=/etc/prometheus/console_libraries
      --web.console.templates=/etc/prometheus/consoles
      --web.enable-lifecycle
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Tue, 17 Jan 2023 14:12:20 +0800
      Finished:     Tue, 17 Jan 2023 14:12:20 +0800
    Ready:          False
    Restart Count:  12
    Liveness:       http-get http://:9090/-/healthy delay=30s timeout=10s period=15s #success=1 #failure=3
    Readiness:      http-get http://:9090/-/ready delay=30s timeout=4s period=5s #success=1 #failure=3
    Environment:    <none>
    Mounts:
      /data from storage-volume (rw)
      /etc/config from config-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-q545s (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  storage-volume:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  storage-volume-kubeblocks-prometheus-server-0
    ReadOnly:   false
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kubeblocks-prometheus-server
    Optional:  false
  kube-api-access-q545s:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                   Age                    From                                      Message
  ----     ------                   ----                   ----                                      -------
  Warning  FailedScheduling         38m                    default-scheduler                         0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims.
  Normal   Scheduled                38m                    default-scheduler                         Successfully assigned default/kubeblocks-prometheus-server-0 to minikube-m02
  Warning  VolumeConditionAbnormal  38m (x3 over 38m)      csi-pv-monitor-agent-hostpath.csi.k8s.io  The volume isn't mounted
  Normal   SuccessfulAttachVolume   38m                    attachdetach-controller                   AttachVolume.Attach succeeded for volume "pvc-3f8cb5f7-0d76-47c2-8d25-ca3a21014153"
  Warning  FailedMount              38m                    kubelet                                   MountVolume.SetUp failed for volume "config-volume" : failed to sync configmap cache: timed out waiting for the condition
  Normal   Pulling                  38m                    kubelet                                   Pulling image "jimmidyson/configmap-reload:v0.5.0"
  Normal   Started                  38m                    kubelet                                   Started container prometheus-server-configmap-reload
  Normal   Created                  38m                    kubelet                                   Created container prometheus-server-configmap-reload
  Normal   Pulled                   38m                    kubelet                                   Successfully pulled image "jimmidyson/configmap-reload:v0.5.0" in 9.604040254s
  Normal   Pulling                  38m                    kubelet                                   Pulling image "quay.io/prometheus/prometheus:v2.39.1"
  Normal   Pulled                   37m                    kubelet                                   Successfully pulled image "quay.io/prometheus/prometheus:v2.39.1" in 44.065166519s
  Normal   Started                  37m (x3 over 37m)      kubelet                                   Started container prometheus-server
  Normal   Created                  36m (x4 over 37m)      kubelet                                   Created container prometheus-server
  Normal   Pulled                   36m (x3 over 37m)      kubelet                                   Container image "quay.io/prometheus/prometheus:v2.39.1" already present on machine
  Normal   VolumeConditionNormal    3m24s (x136 over 37m)  csi-pv-monitor-agent-hostpath.csi.k8s.io  The Volume returns to the healthy state
  Warning  BackOff                  3m14s (x177 over 37m)  kubelet                                   Back-off restarting failed container

@yimeisun
Copy link
Contributor

success in docker-desktop and eks.

@yimeisun
Copy link
Contributor

success in minikube when use the default standard storageclass and csi minikube-hostpath, and fail in minikube when use addon csi csi-hostpath-driver.

@yimeisun
Copy link
Contributor

Under normal circumstances, the pv created by StorageClass corresponds to a directory permission of drwxrwxrwx, but the directory permission created by csi csi-hostpath-driver in minikube is drwxr-xr-x, causing the prometheus process to fail to write when it starts. Change the directory permission from drwxr-xr-x to drwxrwxrwx, the prometheus process restarts successfully.

@yimeisun
Copy link
Contributor

yimeisun commented Jan 19, 2023

The backup tool also has this problem.

@yimeisun
Copy link
Contributor

CSI Hostpath Driver is just a demo and has many non-standard best practices.
20230119-140249

@yimeisun yimeisun assigned dengshaojiang and unassigned yimeisun Jan 19, 2023
@JashBook
Copy link
Collaborator

JashBook commented Jan 19, 2023

kubectl describe pod kubeblocks-prometheus-server-0
eks not success KubeBlocks: 0.3.0-beta.7

kubectl get pod
NAME                                              READY   STATUS             RESTARTS       AGE
kubeblocks-6778fc5b68-72zsl                       1/1     Running            0              7m44s
kubeblocks-grafana-b765d544f-7cqxv                3/3     Running            0              7m44s
kubeblocks-prometheus-alertmanager-0              2/2     Running            0              7m43s
kubeblocks-prometheus-server-0                    1/2     CrashLoopBackOff   6 (101s ago)   7m43s
kubeblocks-snapshot-controller-6bf96bcbc8-cw6nf   1/1     Running            0              7m44s
kubectl describe pod kubeblocks-prometheus-server-0
Name:             kubeblocks-prometheus-server-0
Namespace:        default
Priority:         0
Service Account:  kubeblocks-prometheus-server
Node:             ip-172-31-45-53.cn-northwest-1.compute.internal/172.31.45.53
Start Time:       Thu, 19 Jan 2023 14:02:47 +0800
Labels:           app=prometheus
                  chart=prometheus-15.16.1
                  component=server
                  controller-revision-hash=kubeblocks-prometheus-server-699bb9f7d9
                  heritage=Helm
                  release=kubeblocks
                  statefulset.kubernetes.io/pod-name=kubeblocks-prometheus-server-0
Annotations:      kubernetes.io/psp: eks.privileged
Status:           Running
IP:               172.31.45.19
IPs:
  IP:           172.31.45.19
Controlled By:  StatefulSet/kubeblocks-prometheus-server
Containers:
  prometheus-server-configmap-reload:
    Container ID:  docker://17310ff781112b16fd5df2e9a2d38a6e2494cbaee5961cd15dc92d9d21ff7cf4
    Image:         jimmidyson/configmap-reload:v0.5.0
    Image ID:      docker-pullable://jimmidyson/configmap-reload@sha256:904d08e9f701d3d8178cb61651dbe8edc5d08dd5895b56bdcac9e5805ea82b52
    Port:          <none>
    Host Port:     <none>
    Args:
      --volume-dir=/etc/config
      --webhook-url=http://127.0.0.1:9090/-/reload
    State:          Running
      Started:      Thu, 19 Jan 2023 14:02:56 +0800
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /etc/config from config-volume (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-g65db (ro)
  prometheus-server:
    Container ID:  docker://c2787aa334c25c0b69a0179d8724272e198f3b8d2133eddf55e11f5b75bd5c15
    Image:         quay.io/prometheus/prometheus:v2.39.1
    Image ID:      docker-pullable://quay.io/prometheus/prometheus@sha256:4748e26f9369ee7270a7cd3fb9385c1adb441c05792ce2bce2f6dd622fd91d38
    Port:          9090/TCP
    Host Port:     0/TCP
    Args:
      --storage.tsdb.retention.time=15d
      --config.file=/etc/config/prometheus.yml
      --storage.tsdb.path=/data
      --web.console.libraries=/etc/prometheus/console_libraries
      --web.console.templates=/etc/prometheus/consoles
      --web.enable-lifecycle
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 19 Jan 2023 14:08:49 +0800
      Finished:     Thu, 19 Jan 2023 14:08:49 +0800
    Ready:          False
    Restart Count:  6
    Liveness:       http-get http://:9090/-/healthy delay=30s timeout=10s period=15s #success=1 #failure=3
    Readiness:      http-get http://:9090/-/ready delay=30s timeout=4s period=5s #success=1 #failure=3
    Environment:    <none>
    Mounts:
      /data from storage-volume (rw)
      /etc/config from config-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-g65db (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  storage-volume:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  storage-volume-kubeblocks-prometheus-server-0
    ReadOnly:   false
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kubeblocks-prometheus-server
    Optional:  false
  kube-api-access-g65db:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age                   From                     Message
  ----     ------                  ----                  ----                     -------
  Normal   Scheduled               7m13s                 default-scheduler        Successfully assigned default/kubeblocks-prometheus-server-0 to ip-172-31-45-53.cn-northwest-1.compute.internal
  Normal   SuccessfulAttachVolume  7m11s                 attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-a80c037e-1953-4f20-b91b-17d433828274"
  Normal   Pulled                  7m4s                  kubelet                  Container image "jimmidyson/configmap-reload:v0.5.0" already present on machine
  Normal   Created                 7m4s                  kubelet                  Created container prometheus-server-configmap-reload
  Normal   Started                 7m4s                  kubelet                  Started container prometheus-server-configmap-reload
  Normal   Pulled                  6m19s (x4 over 7m4s)  kubelet                  Container image "quay.io/prometheus/prometheus:v2.39.1" already present on machine
  Normal   Created                 6m19s (x4 over 7m4s)  kubelet                  Created container prometheus-server
  Normal   Started                 6m19s (x4 over 7m4s)  kubelet                  Started container prometheus-server
  Warning  BackOff                 2m (x31 over 7m3s)    kubelet                  Back-off restarting failed container
kubectl logs kubeblocks-prometheus-server-0 prometheus-server
ts=2023-01-19T06:08:49.545Z caller=main.go:543 level=info msg="Starting Prometheus Server" mode=server version="(version=2.39.1, branch=HEAD, revision=dcd6af9e0d56165c6f5c64ebbc1fae798d24933a)"
ts=2023-01-19T06:08:49.545Z caller=main.go:548 level=info build_context="(go=go1.19.2, user=root@273d60c69592, date=20221007-16:03:45)"
ts=2023-01-19T06:08:49.545Z caller=main.go:549 level=info host_details="(Linux 5.4.219-126.411.amzn2.aarch64 #1 SMP Wed Nov 2 17:44:17 UTC 2022 aarch64 kubeblocks-prometheus-server-0 (none))"
ts=2023-01-19T06:08:49.545Z caller=main.go:550 level=info fd_limits="(soft=1048576, hard=1048576)"
ts=2023-01-19T06:08:49.545Z caller=main.go:551 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2023-01-19T06:08:49.548Z caller=web.go:559 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090
ts=2023-01-19T06:08:49.548Z caller=main.go:980 level=info msg="Starting TSDB ..."
ts=2023-01-19T06:08:49.549Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1674032399493 maxt=1674036000000 ulid=01GQ2C086PHAZ3R0S4VFFDS0N6
ts=2023-01-19T06:08:49.549Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1674036000004 maxt=1674043200000 ulid=01GQ2FE4993VMGDCBPHNY76CES
ts=2023-01-19T06:08:49.549Z caller=tls_config.go:195 level=info component=web msg="TLS is disabled." http2=false
ts=2023-01-19T06:08:49.550Z caller=main.go:839 level=info msg="Stopping scrape discovery manager..."
ts=2023-01-19T06:08:49.550Z caller=main.go:853 level=info msg="Stopping notify discovery manager..."
ts=2023-01-19T06:08:49.550Z caller=manager.go:957 level=info component="rule manager" msg="Stopping rule manager..."
ts=2023-01-19T06:08:49.550Z caller=manager.go:967 level=info component="rule manager" msg="Rule manager stopped"
ts=2023-01-19T06:08:49.550Z caller=main.go:890 level=info msg="Stopping scrape manager..."
ts=2023-01-19T06:08:49.550Z caller=main.go:849 level=info msg="Notify discovery manager stopped"
ts=2023-01-19T06:08:49.550Z caller=main.go:835 level=info msg="Scrape discovery manager stopped"
ts=2023-01-19T06:08:49.550Z caller=main.go:882 level=info msg="Scrape manager stopped"
ts=2023-01-19T06:08:49.550Z caller=manager.go:943 level=info component="rule manager" msg="Starting rule manager..."
ts=2023-01-19T06:08:49.550Z caller=notifier.go:608 level=info component=notifier msg="Stopping notification manager..."
ts=2023-01-19T06:08:49.550Z caller=main.go:1110 level=info msg="Notifier manager stopped"
ts=2023-01-19T06:08:49.550Z caller=main.go:1119 level=error err="opening storage failed: lock DB directory: open /data/lock: no space left on device"

@yimeisun
Copy link
Contributor

yimeisun commented Jan 19, 2023

@JashBook This is another problem and see the log below:
ts=2023-01-19T06:08:49.550Z caller=main.go:1119 level=error err="opening storage failed: lock DB directory: open /data/lock: no space left on device"

The default pv size is 1Gi and retention data for 15d. It is only for the minimal requirement, user should enlarge it for online production.

The PVs created for prometheus and alertmanager will reuse after reinstall in order to not lost data. If you do not require the past data, you should delete the pv manually after uninstall.

@ahjing99 ahjing99 modified the milestones: Release 0.3.0, Release 0.4.0 Jan 19, 2023
@dengshaojiang
Copy link
Contributor

dengshaojiang commented Jan 19, 2023

The hostPath volume mounts a directory from the host node's filesystem into your Pod, and having the same group and ownership with Kubelet.
The security context of mounted volumes depends on the implementation of the csi driver(ref link), and set securityContext.fsGroup may NOT take effect.
You need to initialize the mounted directory before the application starts, or use the init container to initialize the permissions of the mount directory.

References:
https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#delegating-volume-permission-and-ownership-change-to-csi-driver
https://kubernetes-csi.github.io/docs/support-fsgroup.html#overview-1
kubernetes/kubernetes#2630

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants