All the flux pods are in CrashLoopBackOff, and no logs to the pods. #4703

ashok-busi · 2024-04-04T19:41:54Z

ashok-busi
Apr 4, 2024

All the flux pods are in CrashLoopBackOff state, and no logs to the pods on onprem kubernetes cluster.

flux version : 2.2.3
Kubernetes version : 1.28.6

NAME READY STATUS RESTARTS AGE
helm-controller-9b59fdc77-dqhkk 0/1 CrashLoopBackOff 12 (40s ago) 36m
image-automation-controller-7bd89767cc-vtg8n 0/1 CrashLoopBackOff 23 (3m25s ago) 96m
image-reflector-controller-6bbc86d5b9-5b2qp 0/1 CrashLoopBackOff 4 (18s ago) 119s
kustomize-controller-577b4ddbdd-pjm82 0/1 CrashLoopBackOff 3 (24s ago) 81s
notification-controller-797cc7d56d-5vj92 0/1 CrashLoopBackOff 23 (3m8s ago) 96m
source-controller-768b889f6d-drvw6 0/1 CrashLoopBackOff 23 (3m14s ago) 96m

couldn't find any error related to the flux.

Example for the pod events:

[root@uslxcp89634 ~]# kubectl describe pod image-automation-controller-59667c6b6c-j7k2g -n flux
Name: image-automation-controller-59667c6b6c-j7k2g
Namespace: flux
Priority: 0
Service Account: image-automation-controller
Node: uslxcp89686/10.74.21.26
Start Time: Thu, 04 Apr 2024 15:20:03 -0400
Labels: app=image-automation-controller
pod-template-hash=59667c6b6c
Annotations: kubectl.kubernetes.io/restartedAt: 2024-04-04T13:23:15-04:00
prometheus.io/port: 8080
prometheus.io/scrape: true
Status: Running
IP: 192.168.121.36
IPs:
IP: 192.168.121.36
Controlled By: ReplicaSet/image-automation-controller-59667c6b6c
Containers:
manager:
Container ID: containerd://2d69d32778bfc568f998e6092716876ea660e8622a8986f9461ee61ec0a7fe41
Image: ghcr.io/fluxcd/image-automation-controller:v0.37.1
Image ID: ghcr.io/fluxcd/image-automation-controller@sha256:97a8895cab8594af7509a5f2bc5495c03b7346722afc4e1e70bf6e445b7e575d
Ports: 8080/TCP, 9440/TCP
Host Ports: 0/TCP, 0/TCP
SeccompProfile: RuntimeDefault
Args:
--events-addr=http://notification-controller.flux.svc.cluster.local./
--watch-all-namespaces=true
--log-level=debug
--log-encoding=json
--enable-leader-election
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 132
Started: Thu, 04 Apr 2024 15:25:54 -0400
Finished: Thu, 04 Apr 2024 15:25:55 -0400
Ready: False
Restart Count: 6
Limits:
cpu: 1
memory: 1Gi
Requests:
cpu: 100m
memory: 64Mi
Liveness: http-get http://:healthz/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://:healthz/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
RUNTIME_NAMESPACE: flux (v1:metadata.namespace)
Mounts:
/tmp from temp (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-f8bkq (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
temp:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit:
kube-api-access-f8bkq:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message

Normal Scheduled 7m5s default-scheduler Successfully assigned flux/image-automation-controller-59667c6b6c-j7k2g to uslxcp89686
Normal Created 6m21s (x4 over 7m5s) kubelet Created container manager
Normal Started 6m21s (x4 over 7m5s) kubelet Started container manager
Warning Unhealthy 6m20s (x2 over 6m42s) kubelet Readiness probe failed: Get "http://192.168.121.36:9440/readyz": dial tcp 192.168.121.36:9440: connect: connection refused
Normal Pulled 5m40s (x5 over 7m5s) kubelet Container image "ghcr.io/fluxcd/image-automation-controller:v0.37.1" already present on machine
Warning BackOff 112s (x30 over 7m4s) kubelet Back-off restarting failed container manager in pod image-automation-controller-59667c6b6c-j7k2g_flux(87f9ef2f-cfd9-4b7e-8934-caa8c0a8cf95)

-> Steps we followed while troubleshooting but none of them worked.

added additional resources to the cointainers.
Increased periodSeconds to liveness and readiness probes to see the initial errors, but couldn't find any.
Removed Liveness and readiness probes and deployed.
uninstalled flux completely and reinstalled , but still same issue.
restarted the nodes.

Working on this form past two days. nothing has been worked and its frustrated when we couldn't find any logs to troubleshoot.
Please let us know if anyone faced similar issue. any suggestions of advices will be helpful.

Thanks in advance.

stefanprodan · 2024-04-05T06:59:29Z

stefanprodan
Apr 5, 2024
Maintainer

This usually happens if the CNI is not working, you need to investigate the node and see what errors kubelet and the CNI daemon reports. Another option for baremetal, is to run Flux directly on the host network, see https://github.com/stefanprodan/flux-aio

0 replies

ashok-busi · 2024-04-05T17:56:16Z

ashok-busi
Apr 5, 2024
Author

Hi @stefanprodan , Thanks for the response.
we dont think its a CNI issue, as the other service pods are running without any issue.
but however our issue got resolved. we degraded flux version from 2.2.3 to 2.1.2. now the flux pods are up and running.

2 replies

macetw May 10, 2024

I saw this same problem. I am using a k3s cluster (a single node cluster on my laptop) and I, too, was able to recover from the problem by downgrading flux to version 2.1.2 like @ashok-busi did.

stefanprodan May 10, 2024
Maintainer

I've added k3s to our conformance test suite and the latest Flux version works fine, we're testing on k3s 1.28 and 1.29, see #4777

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All the flux pods are in CrashLoopBackOff, and no logs to the pods. #4703

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

All the flux pods are in CrashLoopBackOff, and no logs to the pods. #4703

ashok-busi Apr 4, 2024

Replies: 2 comments · 2 replies

stefanprodan Apr 5, 2024 Maintainer

ashok-busi Apr 5, 2024 Author

macetw May 10, 2024

stefanprodan May 10, 2024 Maintainer

ashok-busi
Apr 4, 2024

Replies: 2 comments 2 replies

stefanprodan
Apr 5, 2024
Maintainer

ashok-busi
Apr 5, 2024
Author

stefanprodan May 10, 2024
Maintainer