Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOMKill for period of time after which it magically resolves without changes #1110

Open
kalmarz opened this issue May 31, 2023 · 29 comments
Open
Labels
area/git Git related issues and pull requests blocked/needs-validation Requires wider review and validation

Comments

@kalmarz
Copy link

kalmarz commented May 31, 2023

source-controller just got OOMKilled out of the blue. It started about an hour ago and I can't see any prior activity that triggered it. No recently added sources, nothing. It started with rc3. I upgraded it to rc4 but it's the same behaviour. The memory usage goes out of the roof and the cluster kills the pod.

❯❯❯ flux stats 
RECONCILERS             RUNNING FAILING SUSPENDED       STORAGE
GitRepository           7       0       0               1.9 MiB
OCIRepository           0       0       0               -
HelmRepository          0       0       0               -
HelmChart               0       0       0               -
Bucket                  0       0       0               -
Kustomization           3       0       0               -
HelmRelease             0       0       0               -
Alert                   0       0       0               -
Provider                0       0       0               -
Receiver                0       0       0               -
ImageUpdateAutomation   6       0       0               -
ImagePolicy             20      0       0               -
ImageRepository         20      0       0               -
❯❯❯ flux check
► checking prerequisites
✔ Kubernetes 1.22.17-eks-0a21954 >=1.20.6-0
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v0.34.0
✔ image-automation-controller: deployment ready
► ghcr.io/fluxcd/image-automation-controller:v0.34.0
✔ image-reflector-controller: deployment ready
► ghcr.io/fluxcd/image-reflector-controller:v0.28.0
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v1.0.0-rc.4
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v1.0.0-rc.4

flux check gets stuck at this point as the source controller is not responding.

@stefanprodan stefanprodan added bug Something isn't working area/git Git related issues and pull requests labels May 31, 2023
@hiddeco
Copy link
Member

hiddeco commented May 31, 2023

Does the time it takes to end up in a OOM kill allow you to collect a HEAP profile? Instructions for this can be found here: https://fluxcd.io/flux/gitops-toolkit/debugging/#collecting-a-profile

@kalmarz
Copy link
Author

kalmarz commented May 31, 2023

heap.out.zip

I tried to catch it right before the kill.

@cwrau
Copy link

cwrau commented May 31, 2023

Same thing is happening with some of our clusters as well

@hiddeco
Copy link
Member

hiddeco commented May 31, 2023

@cwrau did this start with RC.3 as well?

In addition to this, did you run RC.2 or RC.1 before without issues?

Based on the HEAP profile shared, I can't tell what is happening as it's taken before the actual issue seems to occur. What may help is temporarily increasing the limits to be able to take a proper snapshot while the thing happens without the Pod getting killed.

@cwrau
Copy link

cwrau commented May 31, 2023

@cwrau did this start with RC.3 as well?

In addition to this, did you run RC.2 or RC.1 before without issues?

Based on the HEAP profile shared, I can't tell what is happening as it's taken before the actual issue seems to occur. What may help is temporarily increasing the limits to be able to take a proper snapshot while the thing happens without the Pod getting killed.

No, this seems to be kinda version-universal, we have old clusters with source-controller 0.18.0 which also got OOM'd

@hiddeco
Copy link
Member

hiddeco commented May 31, 2023

Is there any chance you are all making use of BitBucket and things have now returned to normal?

@stefanprodan stefanprodan changed the title source-controller gets OOMKilled BitBucket: source-controller gets OOMKilled May 31, 2023
@cwrau
Copy link

cwrau commented May 31, 2023

Is there any chance you are all making use of BitBucket and things have now returned to normal?

No BitBucket for us that I can see;

HelmRepositories;

https://charts.jetstack.io
https://kubernetes.github.io/ingress-nginx
https://kyverno.github.io/kyverno/
https://charts.bitnami.com/bitnami
https://helm-charts.bitpoke.io
https://opensource.zalando.com/postgres-operator/charts/postgres-operator/
https://prometheus-community.github.io/helm-charts
https://vmware-tanzu.github.io/helm-charts/
https://grafana.github.io/helm-charts
https://aquasecurity.github.io/helm-charts/

GitRepositories;

ssh://git@gitlab.REDACTED
ssh://git@gitlab.REDACTED

(self hosted gitlab)

But yes, it seems to have stopped

@hiddeco
Copy link
Member

hiddeco commented May 31, 2023

Any chance your crashing instances are hosted on AWS EKS (AMD64, EC2, managed node groups)? I am trying to find a correlation here between your setups, as the issue seems to have started for a group of users at the same time, with a range of controller versions, and then magically stopped.

@ghost
Copy link

ghost commented May 31, 2023

Hello @hiddeco,

I'm with the same organization as @cwrau. No we don't use AWS. We use our own infrastructure based on OpenStack. We are also trying to find correlations. Especially since this came and went out of nothing.

Interestingly, a bunch of clusters using source-controller 1.0.0-rc3 didn't have the issue.

@hiddeco hiddeco changed the title BitBucket: source-controller gets OOMKilled OOMKill for period of time after which it magically resolves without changes May 31, 2023
@kalmarz
Copy link
Author

kalmarz commented May 31, 2023

I see some bump in the DNS traffic for this period.

image

@hiddeco
Copy link
Member

hiddeco commented May 31, 2023

Without this happening again, and a proper HEAP snapshot when this happens, I fear this will be very much like looking for a needle in a haystack.

@cwrau
Copy link

cwrau commented May 31, 2023

Interestingly, a bunch of clusters using source-controller 1.0.0-rc3 didn't have the issue.

The clusters I know of just don't have a memory limit 😉

@hiddeco hiddeco added blocked/needs-validation Requires wider review and validation and removed bug Something isn't working labels May 31, 2023
@ghost
Copy link

ghost commented May 31, 2023

Interestingly, a bunch of clusters using source-controller 1.0.0-rc3 didn't have the issue.

The clusters I know of just don't have a memory limit wink

Oh, you're right. The all don't. Ooops!

@cwrau
Copy link

cwrau commented May 31, 2023

I see some bump in the DNS traffic for this period.

image

Same thing happened on at least one of our clusters;

https://snapshots.raintank.io/dashboard/snapshot/6QEyFh33cs2tH6FYj8enOeY0rlbY7jvc

@hiddeco
Copy link
Member

hiddeco commented May 31, 2023

Wouldn't this just be explained by the rapid pod restarts themselves, causing a burst in terms of queueing objects?

@cwrau
Copy link

cwrau commented May 31, 2023

Wouldn't this just be explained by the rapid pod restarts themselves, causing a burst in terms of queueing objects?

Yeah, I was thinking the same thing but in my snapshot you can see that the responses take longer and the request and response sizes also got bigger, maybe that's got something to do with this?

@hiddeco
Copy link
Member

hiddeco commented May 31, 2023

It's a signal, but hard to tell how that would result in such spurious resource usage. Not to mention that the correlation in time (while running in entirely different environments) continues to be a mystery.

Any chance you collect logs somewhere that survive pod restarts?

@cwrau
Copy link

cwrau commented May 31, 2023

It's a signal, but hard to tell how that would result in such spurious resource usage. Not to mention that the correlation in time (while running in entirely different environments) continues to be a mystery.

Any chance you collect logs somewhere that survive pod restarts?

Of course; https://snapshots.raintank.io/dashboard/snapshot/o6vLv6rjgym3qTdZs2AH6c0m3CjQrIyl

Better dashboard; https://snapshots.raintank.io/dashboard/snapshot/WinIZkonEwYwQaYoh8fQ1NtW2NK2yWxp

@hiddeco
Copy link
Member

hiddeco commented May 31, 2023

Based on the following log lines (and alike), it seems that the whole cluster network in general was unstable at the time.

{"level":"error","ts":"2023-05-31T11:57:14.737Z","logger":"runtime","msg":"Failed to release lock: Unauthorized\n"}
{"level":"info","ts":"2023-05-31T11:57:14.706Z","logger":"runtime","msg":"failed to renew lease flux-system/source-controller-leader-election: timed out waiting for the condition\n"}
{"level":"info","ts":"2023-05-31T11:57:14.703Z","msg":"Wait completed, proceeding to shutdown the manager"}
{"level":"error","ts":"2023-05-31T11:57:12.809Z","logger":"runtime","msg":"k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: Failed to watch *v1beta2.OCIRepository: failed to list *v1beta2.OCIRepository: Unauthorized\n"}
{"level":"info","ts":"2023-05-31T11:57:12.809Z","logger":"runtime","msg":"k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: failed to list *v1beta2.OCIRepository: Unauthorized\n"}	
{"level":"error","ts":"2023-05-31T11:57:12.788Z","logger":"runtime","msg":"k8s.io/client-go@v0.26.3/tools/cache/reflector.go:169: Failed to watch *v1beta2.Bucket: failed to list *v1beta2.Bucket: Unauthorized\n"}	
{"level":"error","ts":"2023-05-31T11:57:10.671Z","logger":"runtime","msg":"Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:\"\", APIVersion:\"\"}, ObjectMeta:v1.ObjectMeta{Name:\"grafana.17643873d2342f01\", GenerateName:\"\", Namespace:\"flux-system\", SelfLink:\"\", UID:\"\", ResourceVersion:\"\", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:\"HelmRepository\", Namespace:\"flux-system\", Name:\"grafana\", UID:\"8d83e45a-9757-4b0d-8452-53a4d9404aa1\", APIVersion:\"[source.toolkit.fluxcd.io/v1beta2](http://source.toolkit.fluxcd.io/v1beta2)\", ResourceVersion:\"62460920\", FieldPath:\"\"}, Reason:\"GarbageCollectionSucceeded\", Message:\"garbage collected artifacts for deleted resource\", Source:v1.EventSource{Component:\"source-controller\", Host:\"\"}, FirstTimestamp:time.Date(2023, time.May, 31, 11, 57, 10, 642437889, time.Local), LastTimestamp:time.Date(2023, time.May, 31, 11, 57, 10, 642437889, time.Local), Count:1, Type:\"Normal\", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:\"\", Related:(*v1.ObjectReference)(nil), ReportingController:\"\", ReportingInstance:\"\"}': 'events \"grafana.17643873d2342f01\" is forbidden: unable to create new content in namespace flux-system because it is being terminated' (will not retry!)\n"}

This by itself does not explain the OOMKill, but may lead to pointers elsewhere (CoreDNS, Kubernetes API server?) which explain how we can replicate this.

@cwrau
Copy link

cwrau commented May 31, 2023

Mh, it just occurred to me, that that cluster might not be the best example, so let's instead take a look at this one;

CoreDNS; https://snapshots.raintank.io/dashboard/snapshot/v7J2X7eC7xHsSXGRdRjJ3pXiTe6kja4r
Logs; https://snapshots.raintank.io/dashboard/snapshot/eTRez5Xo65uZEaMwtoJ1Qh7MqwvSBUyv
Resources; https://snapshots.raintank.io/dashboard/snapshot/ZacWHodjq5khruJB8V4kVxxJA7lnll3w

Sorrey 😅

@haarchri
Copy link

we see the same on our 15+ Clusters in AWS EKS - Karpenter Nodes OOM137- we running latest version before RC versions

what helps to find the problem ? Heap dump ?

@akesser
Copy link

akesser commented Jun 27, 2023

Here are two heap maps of the same pod, there are about 3h30min between these two:
heap.zip

The memory consumption increased from about 290MB to 2240MB in this time

@akesser
Copy link

akesser commented Jun 28, 2023

Here is a heap map of the fiftieth incarnation of the pod in a crash loop where ist consumes about 12.4 GB of memory
files.zip

@hiddeco
Copy link
Member

hiddeco commented Jun 29, 2023

Please share the output from flux stats and flux version.

Generally speaking, it is recommended to update to newest versions even if they are release candidates. As we also documented in our release notes.

In addition, I can see a lot of your memory usage @akesser is coming from the parsing of Helm repository indexes. It may be worth looking at https://fluxcd.io/flux/cheatsheets/bootstrap/#enable-helm-repositories-caching to see if this prevents the issue from happening.

@akesser
Copy link

akesser commented Jul 17, 2023

We updated to the newest version of flux. With the old version, the error occurred every 24 hours, now it occurs every 5 to 10 minutes.

And we use helm repository caching:

--helm-cache-max-size=2000
--helm-cache-ttl=60m
--helm-cache-purge-interval=5m
--helm-index-max-size=64428800

@akesser
Copy link

akesser commented Jul 17, 2023

This is the overview of flux stats

RECONCILERS          	RUNNING	FAILING	SUSPENDED	STORAGE   
GitRepository        	4      	0      	0        	908.9 KiB	
OCIRepository        	1      	0      	0        	498 B    	
HelmRepository       	30     	0      	0        	81.0 MiB 	
HelmChart            	221    	2      	0        	5.3 MiB  	
Bucket               	0      	0      	0        	-        	
Kustomization        	53     	7      	0        	-        	
HelmRelease          	87     	5      	0        	-        	
Alert                	0      	0      	0        	-        	
Provider             	0      	0      	0        	-        	
Receiver             	0      	0      	0        	-        	
ImageUpdateAutomation	0      	0      	0        	-        	
ImagePolicy          	0      	0      	0        	-        	
ImageRepository      	0      	0      	0        	-  

The old version consumed about 5GB of ram 0.4 cores, the following images shows cpu and ram of the newer version of flux:

Bildschirmfoto 2023-07-17 um 23 19 36

It consumes 83GB of RAM and 17.8 cores.

@akesser
Copy link

akesser commented Jul 18, 2023

And here is a heap dump:
heap2.out.zip

@akesser
Copy link

akesser commented Jul 18, 2023

here is an other screenshot showing the pod consuming 44.6 CPU and 144GB of RAM:
Bildschirmfoto 2023-07-18 um 06 52 13

@akesser
Copy link

akesser commented Jul 18, 2023

One entry for gotk_cache_events_total that comes to one eyes is this:
Bildschirmfoto 2023-07-18 um 10 16 45
All other entries are below 500
This is the graph of the counter:
Bildschirmfoto 2023-07-18 um 10 18 06

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/git Git related issues and pull requests blocked/needs-validation Requires wider review and validation
Projects
None yet
Development

No branches or pull requests

6 participants