OOMKill for period of time after which it magically resolves without changes #1110

kalmarz · 2023-05-31T11:31:13Z

source-controller just got OOMKilled out of the blue. It started about an hour ago and I can't see any prior activity that triggered it. No recently added sources, nothing. It started with rc3. I upgraded it to rc4 but it's the same behaviour. The memory usage goes out of the roof and the cluster kills the pod.

❯❯❯ flux stats 
RECONCILERS             RUNNING FAILING SUSPENDED       STORAGE
GitRepository           7       0       0               1.9 MiB
OCIRepository           0       0       0               -
HelmRepository          0       0       0               -
HelmChart               0       0       0               -
Bucket                  0       0       0               -
Kustomization           3       0       0               -
HelmRelease             0       0       0               -
Alert                   0       0       0               -
Provider                0       0       0               -
Receiver                0       0       0               -
ImageUpdateAutomation   6       0       0               -
ImagePolicy             20      0       0               -
ImageRepository         20      0       0               -

❯❯❯ flux check
► checking prerequisites
✔ Kubernetes 1.22.17-eks-0a21954 >=1.20.6-0
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v0.34.0
✔ image-automation-controller: deployment ready
► ghcr.io/fluxcd/image-automation-controller:v0.34.0
✔ image-reflector-controller: deployment ready
► ghcr.io/fluxcd/image-reflector-controller:v0.28.0
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v1.0.0-rc.4
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v1.0.0-rc.4

flux check gets stuck at this point as the source controller is not responding.

The text was updated successfully, but these errors were encountered:

hiddeco · 2023-05-31T11:47:38Z

Does the time it takes to end up in a OOM kill allow you to collect a HEAP profile? Instructions for this can be found here: https://fluxcd.io/flux/gitops-toolkit/debugging/#collecting-a-profile

kalmarz · 2023-05-31T12:02:01Z

heap.out.zip

I tried to catch it right before the kill.

cwrau · 2023-05-31T12:09:11Z

Same thing is happening with some of our clusters as well

hiddeco · 2023-05-31T12:16:09Z

@cwrau did this start with RC.3 as well?

In addition to this, did you run RC.2 or RC.1 before without issues?

Based on the HEAP profile shared, I can't tell what is happening as it's taken before the actual issue seems to occur. What may help is temporarily increasing the limits to be able to take a proper snapshot while the thing happens without the Pod getting killed.

cwrau · 2023-05-31T12:20:42Z

@cwrau did this start with RC.3 as well?

In addition to this, did you run RC.2 or RC.1 before without issues?

Based on the HEAP profile shared, I can't tell what is happening as it's taken before the actual issue seems to occur. What may help is temporarily increasing the limits to be able to take a proper snapshot while the thing happens without the Pod getting killed.

No, this seems to be kinda version-universal, we have old clusters with source-controller 0.18.0 which also got OOM'd

hiddeco · 2023-05-31T12:25:31Z

Is there any chance you are all making use of BitBucket and things have now returned to normal?

cwrau · 2023-05-31T12:29:36Z

Is there any chance you are all making use of BitBucket and things have now returned to normal?

No BitBucket for us that I can see;

HelmRepositories;

https://charts.jetstack.io
https://kubernetes.github.io/ingress-nginx
https://kyverno.github.io/kyverno/
https://charts.bitnami.com/bitnami
https://helm-charts.bitpoke.io
https://opensource.zalando.com/postgres-operator/charts/postgres-operator/
https://prometheus-community.github.io/helm-charts
https://vmware-tanzu.github.io/helm-charts/
https://grafana.github.io/helm-charts
https://aquasecurity.github.io/helm-charts/

GitRepositories;

ssh://git@gitlab.REDACTED
ssh://git@gitlab.REDACTED

(self hosted gitlab)

But yes, it seems to have stopped

hiddeco · 2023-05-31T12:34:02Z

Any chance your crashing instances are hosted on AWS EKS (AMD64, EC2, managed node groups)? I am trying to find a correlation here between your setups, as the issue seems to have started for a group of users at the same time, with a range of controller versions, and then magically stopped.

ghost · 2023-05-31T12:39:41Z

Hello @hiddeco,

I'm with the same organization as @cwrau. No we don't use AWS. We use our own infrastructure based on OpenStack. We are also trying to find correlations. Especially since this came and went out of nothing.

Interestingly, a bunch of clusters using source-controller 1.0.0-rc3 didn't have the issue.

kalmarz · 2023-05-31T12:43:46Z

I see some bump in the DNS traffic for this period.

hiddeco · 2023-05-31T12:45:06Z

Without this happening again, and a proper HEAP snapshot when this happens, I fear this will be very much like looking for a needle in a haystack.

cwrau · 2023-05-31T12:45:27Z

Interestingly, a bunch of clusters using source-controller 1.0.0-rc3 didn't have the issue.

The clusters I know of just don't have a memory limit 😉

ghost · 2023-05-31T12:51:51Z

Interestingly, a bunch of clusters using source-controller 1.0.0-rc3 didn't have the issue.

The clusters I know of just don't have a memory limit wink

Oh, you're right. The all don't. Ooops!

cwrau · 2023-05-31T12:51:54Z

I see some bump in the DNS traffic for this period.

Same thing happened on at least one of our clusters;

https://snapshots.raintank.io/dashboard/snapshot/6QEyFh33cs2tH6FYj8enOeY0rlbY7jvc

hiddeco · 2023-05-31T12:55:32Z

Wouldn't this just be explained by the rapid pod restarts themselves, causing a burst in terms of queueing objects?

cwrau · 2023-05-31T12:57:46Z

Wouldn't this just be explained by the rapid pod restarts themselves, causing a burst in terms of queueing objects?

Yeah, I was thinking the same thing but in my snapshot you can see that the responses take longer and the request and response sizes also got bigger, maybe that's got something to do with this?

hiddeco · 2023-05-31T13:04:40Z

It's a signal, but hard to tell how that would result in such spurious resource usage. Not to mention that the correlation in time (while running in entirely different environments) continues to be a mystery.

Any chance you collect logs somewhere that survive pod restarts?

cwrau · 2023-05-31T13:09:02Z

It's a signal, but hard to tell how that would result in such spurious resource usage. Not to mention that the correlation in time (while running in entirely different environments) continues to be a mystery.

Any chance you collect logs somewhere that survive pod restarts?

Of course; https://snapshots.raintank.io/dashboard/snapshot/o6vLv6rjgym3qTdZs2AH6c0m3CjQrIyl

Better dashboard; https://snapshots.raintank.io/dashboard/snapshot/WinIZkonEwYwQaYoh8fQ1NtW2NK2yWxp

hiddeco · 2023-05-31T13:20:01Z

Based on the following log lines (and alike), it seems that the whole cluster network in general was unstable at the time.

{"level":"error","ts":"2023-05-31T11:57:14.737Z","logger":"runtime","msg":"Failed to release lock: Unauthorized\n"}
{"level":"info","ts":"2023-05-31T11:57:14.706Z","logger":"runtime","msg":"failed to renew lease flux-system/source-controller-leader-election: timed out waiting for the condition\n"}
{"level":"info","ts":"2023-05-31T11:57:14.703Z","msg":"Wait completed, proceeding to shutdown the manager"}

{"level":"error","ts":"2023-05-31T11:57:12.809Z","logger":"runtime","msg":"k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: Failed to watch *v1beta2.OCIRepository: failed to list *v1beta2.OCIRepository: Unauthorized\n"}
{"level":"info","ts":"2023-05-31T11:57:12.809Z","logger":"runtime","msg":"k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: failed to list *v1beta2.OCIRepository: Unauthorized\n"}	
{"level":"error","ts":"2023-05-31T11:57:12.788Z","logger":"runtime","msg":"k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: Failed to watch *v1beta2.Bucket: failed to list *v1beta2.Bucket: Unauthorized\n"}

{"level":"error","ts":"2023-05-31T11:57:10.671Z","logger":"runtime","msg":"Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:\"\", APIVersion:\"\"}, ObjectMeta:v1.ObjectMeta{Name:\"grafana.17643873d2342f01\", GenerateName:\"\", Namespace:\"flux-system\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:\"HelmRepository\", Namespace:\"flux-system\", Name:\"grafana\", UID:\"8d83e45a-9757-4b0d-8452-53a4d9404aa1\", APIVersion:\"[source.toolkit.fluxcd.io/v1beta2](http://source.toolkit.fluxcd.io/v1beta2)\", ResourceVersion:\"62460920\", FieldPath:\"\"}, Reason:\"GarbageCollectionSucceeded\", Message:\"garbage collected artifacts for deleted resource\", Source:v1.EventSource{Component:\"source-controller\", Host:\"\"}, FirstTimestamp:time.Date(2023, time.May, 31, 11, 57, 10, 642437889, time.Local), LastTimestamp:time.Date(2023, time.May, 31, 11, 57, 10, 642437889, time.Local), Count:1, Type:\"Normal\", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:\"\", Related:(*v1.ObjectReference)(nil), ReportingController:\"\", ReportingInstance:\"\"}': 'events \"grafana.17643873d2342f01\" is forbidden: unable to create new content in namespace flux-system because it is being terminated' (will not retry!)\n"}

This by itself does not explain the OOMKill, but may lead to pointers elsewhere (CoreDNS, Kubernetes API server?) which explain how we can replicate this.

cwrau · 2023-05-31T13:24:48Z

Mh, it just occurred to me, that that cluster might not be the best example, so let's instead take a look at this one;

CoreDNS; https://snapshots.raintank.io/dashboard/snapshot/v7J2X7eC7xHsSXGRdRjJ3pXiTe6kja4r
Logs; https://snapshots.raintank.io/dashboard/snapshot/eTRez5Xo65uZEaMwtoJ1Qh7MqwvSBUyv
Resources; https://snapshots.raintank.io/dashboard/snapshot/ZacWHodjq5khruJB8V4kVxxJA7lnll3w

Sorrey 😅

haarchri · 2023-06-19T19:46:28Z

we see the same on our 15+ Clusters in AWS EKS - Karpenter Nodes OOM137- we running latest version before RC versions

what helps to find the problem ? Heap dump ?

akesser · 2023-06-27T09:48:34Z

Here are two heap maps of the same pod, there are about 3h30min between these two:
heap.zip

The memory consumption increased from about 290MB to 2240MB in this time

akesser · 2023-06-28T06:42:11Z

Here is a heap map of the fiftieth incarnation of the pod in a crash loop where ist consumes about 12.4 GB of memory
files.zip

hiddeco · 2023-06-29T19:34:42Z

Please share the output from flux stats and flux version.

Generally speaking, it is recommended to update to newest versions even if they are release candidates. As we also documented in our release notes.

In addition, I can see a lot of your memory usage @akesser is coming from the parsing of Helm repository indexes. It may be worth looking at https://fluxcd.io/flux/cheatsheets/bootstrap/#enable-helm-repositories-caching to see if this prevents the issue from happening.

akesser · 2023-07-17T19:28:53Z

We updated to the newest version of flux. With the old version, the error occurred every 24 hours, now it occurs every 5 to 10 minutes.

And we use helm repository caching:

--helm-cache-max-size=2000
--helm-cache-ttl=60m
--helm-cache-purge-interval=5m
--helm-index-max-size=64428800

akesser · 2023-07-17T21:25:52Z

This is the overview of flux stats

RECONCILERS          	RUNNING	FAILING	SUSPENDED	STORAGE   
GitRepository        	4      	0      	0        	908.9 KiB	
OCIRepository        	1      	0      	0        	498 B    	
HelmRepository       	30     	0      	0        	81.0 MiB 	
HelmChart            	221    	2      	0        	5.3 MiB  	
Bucket               	0      	0      	0        	-        	
Kustomization        	53     	7      	0        	-        	
HelmRelease          	87     	5      	0        	-        	
Alert                	0      	0      	0        	-        	
Provider             	0      	0      	0        	-        	
Receiver             	0      	0      	0        	-        	
ImageUpdateAutomation	0      	0      	0        	-        	
ImagePolicy          	0      	0      	0        	-        	
ImageRepository      	0      	0      	0        	-

The old version consumed about 5GB of ram 0.4 cores, the following images shows cpu and ram of the newer version of flux:

It consumes 83GB of RAM and 17.8 cores.

akesser · 2023-07-18T05:00:21Z

And here is a heap dump:
heap2.out.zip

akesser · 2023-07-18T06:07:20Z

here is an other screenshot showing the pod consuming 44.6 CPU and 144GB of RAM:

akesser · 2023-07-18T09:31:01Z

One entry for gotk_cache_events_total that comes to one eyes is this:

All other entries are below 500
This is the graph of the counter:

stefanprodan added bug Something isn't working area/git Git related issues and pull requests labels May 31, 2023

makkes mentioned this issue May 31, 2023

Prepare for Flux v2.0.0-rc.5 fluxcd/flux2#3939

Closed

19 tasks

stefanprodan changed the title ~~source-controller gets OOMKilled~~ BitBucket: source-controller gets OOMKilled May 31, 2023

hiddeco changed the title ~~BitBucket: source-controller gets OOMKilled~~ OOMKill for period of time after which it magically resolves without changes May 31, 2023

hiddeco added blocked/needs-validation Requires wider review and validation and removed bug Something isn't working labels May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOMKill for period of time after which it magically resolves without changes #1110

OOMKill for period of time after which it magically resolves without changes #1110

kalmarz commented May 31, 2023

hiddeco commented May 31, 2023 •

edited

Loading

kalmarz commented May 31, 2023 •

edited

Loading

cwrau commented May 31, 2023

hiddeco commented May 31, 2023

cwrau commented May 31, 2023

hiddeco commented May 31, 2023

cwrau commented May 31, 2023 •

edited

Loading

hiddeco commented May 31, 2023

ghost commented May 31, 2023

kalmarz commented May 31, 2023

hiddeco commented May 31, 2023

cwrau commented May 31, 2023

ghost commented May 31, 2023

cwrau commented May 31, 2023

hiddeco commented May 31, 2023

cwrau commented May 31, 2023

hiddeco commented May 31, 2023

cwrau commented May 31, 2023 •

edited

Loading

hiddeco commented May 31, 2023

cwrau commented May 31, 2023

haarchri commented Jun 19, 2023

akesser commented Jun 27, 2023

akesser commented Jun 28, 2023

hiddeco commented Jun 29, 2023

akesser commented Jul 17, 2023

akesser commented Jul 17, 2023 •

edited

Loading

akesser commented Jul 18, 2023

akesser commented Jul 18, 2023

akesser commented Jul 18, 2023

OOMKill for period of time after which it magically resolves without changes #1110

OOMKill for period of time after which it magically resolves without changes #1110

Comments

kalmarz commented May 31, 2023

hiddeco commented May 31, 2023 • edited Loading

kalmarz commented May 31, 2023 • edited Loading

cwrau commented May 31, 2023

hiddeco commented May 31, 2023

cwrau commented May 31, 2023

hiddeco commented May 31, 2023

cwrau commented May 31, 2023 • edited Loading

hiddeco commented May 31, 2023

ghost commented May 31, 2023

kalmarz commented May 31, 2023

hiddeco commented May 31, 2023

cwrau commented May 31, 2023

ghost commented May 31, 2023

cwrau commented May 31, 2023

hiddeco commented May 31, 2023

cwrau commented May 31, 2023

hiddeco commented May 31, 2023

cwrau commented May 31, 2023 • edited Loading

hiddeco commented May 31, 2023

cwrau commented May 31, 2023

haarchri commented Jun 19, 2023

akesser commented Jun 27, 2023

akesser commented Jun 28, 2023

hiddeco commented Jun 29, 2023

akesser commented Jul 17, 2023

akesser commented Jul 17, 2023 • edited Loading

akesser commented Jul 18, 2023

akesser commented Jul 18, 2023

akesser commented Jul 18, 2023

hiddeco commented May 31, 2023 •

edited

Loading

kalmarz commented May 31, 2023 •

edited

Loading

cwrau commented May 31, 2023 •

edited

Loading

cwrau commented May 31, 2023 •

edited

Loading

akesser commented Jul 17, 2023 •

edited

Loading